emnlp emnlp2010 emnlp2010-91 knowledge-graph by maker-knowledge-mining

91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding


Source: pdf

Author: Ching-Yun Chang ; Stephen Clark

Abstract: Linguistic Steganography is concerned with hiding information in natural language text. One of the major transformations used in Linguistic Steganography is synonym substitution. However, few existing studies have studied the practical application of this approach. In this paper we propose two improvements to the use of synonym substitution for encoding hidden bits of information. First, we use the Web 1T Google n-gram corpus for checking the applicability of a synonym in context, and we evaluate this method using data from the SemEval lexical substitution task. Second, we address the problem that arises from words with more than one sense, which creates a potential ambiguity in terms of which bits are encoded by a particular word. We develop a novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex colouring algorithm. This method ensures that each word encodes a unique sequence of bits, without cutting out large number of synonyms, and thus maintaining a reasonable embedding capacity.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One of the major transformations used in Linguistic Steganography is synonym substitution. [sent-6, score-0.496]

2 In this paper we propose two improvements to the use of synonym substitution for encoding hidden bits of information. [sent-8, score-0.676]

3 First, we use the Web 1T Google n-gram corpus for checking the applicability of a synonym in context, and we evaluate this method using data from the SemEval lexical substitution task. [sent-9, score-0.732]

4 We develop a novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex colouring algorithm. [sent-11, score-0.634]

5 1 Introduction Steganography is concerned with hiding information in a cover medium, in order to facilitate covert communication, such that the presence of the information is imperceptible to a user (human or computer). [sent-13, score-0.24]

6 Much of the existing research in steganography has used images as cover media; however, given the ubiquitous nature of electronic text, interest is growing in using natural language as the cover medium. [sent-14, score-0.467]

7 In terms of security, a linguistic stegosystem should impose minimum embedding distortion to the cover text so that the resulting stegotext in which a message is camouflaged is inconspicuous, resulting in high imperceptibility. [sent-23, score-0.432]

8 In addition, since steganography aims at covert communication, a linguistic stegosystem should allow sufficient embed- ding capacity, known as the payload. [sent-24, score-0.576]

9 However, the current state-of-the-art in language technology is arguably not good enough for secure linguistic steganography based on sophisticated semantic transformations, and the level of robustness required to perform practical experiments has only just become available. [sent-39, score-0.363]

10 1 Synonym substitution Synonym substitution is a relatively straightforward linguistic steganography method. [sent-42, score-0.668]

11 There are two practical difficulties associated with hiding bits using synonym subsitution. [sent-45, score-0.666]

12 Our solution to this problem is a novel vertex colouring method which ensures that words are always assigned the same bit string, even when they appear in different synsets. [sent-49, score-0.418]

13 2 The resulting precision of our lexical substitution system can be seen as an indirect measure of the imperceptibility of the stegosystem, whereas the recall can be seen as an indirect measure of the payload. [sent-56, score-0.23]

14 Then the vertex colouring method is presented, and finally we show how the contextual check can be integrated with the vertex colouring coding method to give a complete stegsosystem. [sent-58, score-0.959]

15 Also, Chang and Clark (2010) is a re- cent NLP paper which describes the general linguistic steganography framework. [sent-60, score-0.34]

16 html Figure 2: An example of applying the basic algorithm to overlapping synsets can be divided into two codewords, 0 and 10, and the information carriers in the cover text are the words finished and project. [sent-71, score-0.46]

17 This algorithm requires synonym sets to be disjoint; i. [sent-76, score-0.438]

18 no word may appear in more than one synonym set, since overlapping synsets may cause ambiguities during the decoding stage. [sent-78, score-0.789]

19 Figure 2 shows what happens when the basic algorithm is applied to two overlapping synonym sets. [sent-79, score-0.508]

20 As can be seen from the example, composition is represented by two different codewords and thus the secret bitstring cannot be reliably recovered, since the receiver does not know the original cover word or the sense of the word. [sent-80, score-0.466]

21 In order to solve this problem, we propose a novel coding method based on vertex colouring, described in Section 4. [sent-81, score-0.355]

22 In addition to the basic algorithm, Winstein proposed the T-Lex system using synonym substitution as the text transformation. [sent-82, score-0.602]

23 In order to solve the problem of words appearing in more than one synonym set, Winstein defines interchangeable words as words that belong to the same synsets, and only uses these words for substitution. [sent-83, score-0.477]

24 Another stegosystem based on synonym substitution was proposed by Bolshakov (2004). [sent-88, score-0.812]

25 In order to ensure both sender and receiver use the same synsets, Bolshakov applied transitive closure to overlapping synsets to avoid the decoding ambi- guity. [sent-89, score-0.588]

26 Applying transitive closure leads to a merger of all the overlapping synsets into one set which is then seen as the synset of a target word. [sent-90, score-0.653]

27 Consider the overlapping synsets in Figure 2 as an example. [sent-91, score-0.351]

28 Finally, the collocationally verified synonyms are encoded by using the block coding method. [sent-94, score-0.355]

29 The disadvantage of Bolshakov’s system is that all words in a synonym transitive closure chain need to be considered, which can lead to very large sets of synonyms, and many which are not synonymous with the original target word. [sent-96, score-0.538]

30 In contrast, our proposed method operates on the original synonym sets without extending them unnecessarily. [sent-97, score-0.461]

31 Since the purpose of using WordNet is to find possible substitutes for a target word, those synsets containing only one entry are not useful and are ignored by our stegosystem. [sent-100, score-0.425]

32 In addition, our stegosystem only takes single word substitution into consideration in order to avoid the confusion of finding information-carrying words during the decoding phase. [sent-101, score-0.374]

33 For example, if the cover word ‘complete’ is replaced by ‘all over’, the receiver would not know whether the secret message is embedded in the word ‘over’ or the phrase ‘all over’ . [sent-102, score-0.315]

34 Table 1 shows the statistics of synsets used in our stegosystem. [sent-103, score-0.281]

35 6504125 Table 1: Statistics of synsets used in our stegosystem For the contextual check we use the Google Web 1T 5-gram Corpus (Brants and Franz, 2006) which contains counts for n-grams from unigrams through to five-grams obtained from over 1 trillion word tokens of English Web text. [sent-108, score-0.568]

36 2 Synonym Checking Method In order to measure the degree of acceptability in a substitution, the proposed filter calculates a substitution score for a synonym by using the observed frequency counts in the Web n-gram corpus. [sent-113, score-0.624]

37 The method first extracts contextual n-grams around the synonym and queries the n-gram frequency counts from the corpus. [sent-114, score-0.502]

38 The main purpose of having max is to score each word relative to the most likely synonym in the group, so even in less frequent contexts which lead to smaller frequency counts, the score of each synonym can still indicate the degree of feasibility. [sent-119, score-0.876]

39 37 Figure 3: An example of using the proposed synonym checking method ing the substitution score for the synonym ‘pole’ given the cover sentence “This is not a very high bar. [sent-125, score-1.253]

40 3 Evaluation Data In order to evaluate the proposed synonym checking method, we need some data to test whether our method can pick out acceptable substitutions. [sent-133, score-0.6]

41 We use the sentences in this gold standard as the cover text in our experiments so that the substitutes provided by the annotators can be the positive data for evaluating the proposed synonym check3http://corpus. [sent-137, score-0.684]

42 Hence we assume that, if a word in the correct synset for a target word is not in the set produced by the human annotators, then it is inappropriate for that context and a suitable negative example. [sent-147, score-0.303]

43 This method is appropriate because our steganography system has to distinguish between good and bad synonyms from WordNet, given a particular context. [sent-148, score-0.462]

44 For the above reasons, we extract the negative data for our experiments by first matching positive substitutes of a target word to all the synsets that contain the target word in WordNet. [sent-149, score-0.521]

45 The synset that includes the most positive substitutes is used to represent the meaning of the target word. [sent-150, score-0.371]

46 If there is more than one synset containing the highest number of positives, all the synsets are taken into consideration. [sent-151, score-0.483]

47 We then randomly select up to six single-word synonyms other than positive substitutes from the chosen synset(s) as negative instances of the target word. [sent-152, score-0.345]

48 We assume the selected synset represents the meaning of the original word, and those synonyms in the synset which are not an- notated as positives must have a certain degree of mismatch to the context. [sent-155, score-0.542]

49 Therefore, from this example, ‘balance’, ‘residue’, ‘residuum’ and ‘rest’ are extracted as negatives to test whether our synonym checking method can pick out bad substitutions from a set of words sharing similar or the same meaning. [sent-156, score-0.685]

50 Since the main purpose of the data set is to test whether the proposed synonym checking method can guard against inappropriate synonym substitutions and be integrated in the stegosystem, it is reasonable to have a few false negatives in our experimental data. [sent-160, score-1.153]

51 5 48648 Table 4: Performance of the synonym checking method tion than including an inappropriate replacement for a stegosystem in terms of the security. [sent-166, score-0.808]

52 Precision is the percentage of substitutions judged acceptable by the method which are determined to be suitable synonyms by the human judges. [sent-172, score-0.259]

53 Table 4 gives the results for the synonym checking method and the average threshold values over the 5 folds. [sent-177, score-0.613]

54 Therefore, the practical threshold setting would depend on how steganography users want to trade off imperceptibility for payload. [sent-184, score-0.435]

55 Figure 6: An example of coloured synonym graph 4 Proposed Stegosystem 4. [sent-185, score-0.551]

56 1 The Vertex Coloring Coding Method In this section, we propose a novel coding method based on vertex colouring by which each synonym is assigned a unique codeword so the usage of overlapping synsets is not problematic for data embedding and extracting. [sent-186, score-1.635]

57 A vertex colouring is a labelling of the graph’s vertices with colours subject to the condition that no two adjacent vertices share the same colour. [sent-187, score-0.573]

58 The smallest number of colours required to colour a graph G is called its chromatic number χ(G), and a graph G having chromatic number χ(G) = k is called a k-chromatic graph. [sent-188, score-0.462]

59 The main idea of the proposed coding method is to represent overlapping synsets as an undirected k-chromatic graph called a synonym graph which has a vertex for each word and an edge for every pair of words that share the same meaning. [sent-189, score-1.292]

60 A synonym is then encoded by a codeword that represents the colour assigned by the vertex colouring of the synonym graph. [sent-190, score-1.573]

61 Figure 6 shows the use of four different colours, represented by ‘00’, ‘01 ’, ‘ 10’ and ‘ 11’, to colour the 4-chromatic synonym graph of the two overlapping synsets in Figure 2. [sent-191, score-0.942]

62 6% of synsets in WordNet have size less than 8, which means most of the synsets cannot exhaust more than a 2-bit coding space (i. [sent-194, score-0.756]

63 Therefore, we restrict the chromatic number of a synonym graph G to 1 < χ(G) ≤ 4, which implies the maxgimraupmh Gsiz teo o 1f a synset i≤s 44,. [sent-197, score-0.753]

64 wWhhicehn χ(G) = 2, meaacxh1200 Figure 7: Examples of 2,3,4-chromatic synonym graphs vertex is assigned a single-bit codeword either ‘0’ or ‘ 1’ as shown in Figure 7(a). [sent-198, score-0.862]

65 When χ(G) = 3, the overlapping set’s size is either 2 or 3, which cannot exhaust the 2-bit coding space although codewords ‘00’, ‘01 ’ and ‘ 10’ are initially assigned to each vertex. [sent-199, score-0.344]

66 Therefore, only the most significant bits are used to represent the synonyms, which we call codeword reduction. [sent-200, score-0.31]

67 After the codeword reduction, if a vertex has the same codeword, say ‘0’, as all of its neighbors, the vertex’s codeword must be changed to ‘ 1’ so that the vertex would be able to accommodate either secret bit ‘0’ or ‘ 1’, which we call codeword correction. [sent-201, score-1.174]

68 Figure 7(b) shows an example of the process of codeword reduction and codeword correction for χ(G) = 3. [sent-202, score-0.519]

69 For the case of χ(G) = 4, codeword reduction is applied to those vertices that themselves or their neighboring vertices have no access to all the codewords ‘00’, ‘01’, ‘ 10’ and ‘ 11’ . [sent-203, score-0.374]

70 For example, vertices a, b, c, e and f in Figure 7(c) meet the requirement of needing codeword reduction. [sent-204, score-0.277]

71 The codeword correction process is then fur- ther applied to vertex f to rectify its accessibility. [sent-205, score-0.447]

72 Figure 8 describes a greedy algorithm for constructing a coded synonym graph using at most 4 colours, given n synonyms w1, w2,. [sent-206, score-0.65]

73 Let us define a function E(wi, wj) which returns an edge between wi and wj if wi and wj are in the same synset; otherwise returns false. [sent-210, score-0.298]

74 Another function C(wi) returns the colour of the synonym wi. [sent-211, score-0.517]

75 For each iteration, the procedure first finds available colours for the target synonym wi. [sent-213, score-0.628]

76 If there is no colour available, namely all the four colours have already been given to wi’s neighbors, wi is randomly assigned one of the four colours; otherwise, wi is assigned one of the available colours. [sent-214, score-0.44]

77 After adding wi to the graph G, the procedure checks whether adding an edge of wi to graph G would violate the vertex colouring. [sent-215, score-0.468]

78 After constructing the coloured graph, codeword reduction and codeword correction as previously described are applied to revise improper codewords. [sent-216, score-0.558]

79 We define a possible information carrier as a word in the cover sentence that belongs to at least one synset in WordNet. [sent-221, score-0.351]

80 The synsets containing the target word, and all other synsets which can be reached via the synonym relation, are then extracted from WordNet (i. [sent-222, score-1.033]

81 we build the connected component of WordNet which contains the target word according to the synonym relation). [sent-224, score-0.471]

82 If there is more than one word left and if words which pass the filter all belong to the same synset, the block coding method is used to encode the words; otherwise the vertex colouring coding is applied. [sent-226, score-0.718]

83 Finally, according to the secret bitstring, the system selects the synonym that shares an edge with the target word and has as its codeword the longest potential match with the secret bitstring. [sent-227, score-0.909]

84 We use the connected component of WordNet containing the target word as a simple method to ensure that both sender and receiver colour-code the 1201 INPUT: a synonym list w1, w2,. [sent-228, score-0.664]

85 For the decoding process, the receiver does not need the original text for extracting secret data. [sent-233, score-0.232]

86 An information carrier can be found in the stegotext by referring to WordNet in which related synonyms are extracted. [sent-234, score-0.243]

87 Those words in the related sets undergo the synonym checking method and then are encoded by either block coding or vertex colouring coding scheme depending on whether the remaining words are in the same synset. [sent-235, score-1.287]

88 Finally, the secret bitstring is implicit in the codeword of the information carrier and therefore can be extracted. [sent-236, score-0.469]

89 We demonstrate how to embed secret bit 1 in the Figure 9: Framework of the proposed lexical stegosystem sentence “it is a shame that we could not reach the next stage. [sent-237, score-0.475]

90 Table 5 lists the related synsets extracted from WordNet. [sent-239, score-0.281]

91 The score of each word calculated by the synonym checking method using the Web 1T Corpus is given as a subscript. [sent-240, score-0.568]

92 The output of the synonym checking method is shown at the right side of Table 5. [sent-243, score-0.568]

93 Since the remaining words do not belong to the same synset, the vertex colouring coding method is then used to encode the words. [sent-244, score-0.525]

94 Figure 10(a) is the original synset graph in which each vertex is assigned one of the four colours; Figure 10(b) is the graph after applying codeword reduction. [sent-245, score-0.774]

95 The vertex colouring coding method represents synonym substitution as a synonym graph so the relations between words can be clearly observed. [sent-251, score-1.639]

96 In addition, an automatic system for checking synonym acceptability in context is integrated in our stegosystem to ensure information security. [sent-252, score-0.777]

97 For future work, we would like to explore more linguistic transformations that can meet the requirements of linguistic steganography retaining the meaning, grammaticality and style of the original text. [sent-253, score-0.437]

98 In addition, it is crucial to have a full evaluation of the linguistic stegosystem in terms of im— perceptibility and payload capacity so we can know how much data can be embedded before the cover text reaches its maximum distortion which is tolerated by a human judge. [sent-254, score-0.447]

99 A method of linguistic steganography based on collocationally-verified synonym. [sent-281, score-0.363]

100 The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions. [sent-352, score-0.713]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('synonym', 0.438), ('steganography', 0.301), ('synsets', 0.281), ('codeword', 0.236), ('stegosystem', 0.21), ('synset', 0.202), ('colouring', 0.17), ('coding', 0.168), ('substitution', 0.164), ('vertex', 0.164), ('colours', 0.157), ('watermarking', 0.144), ('synonyms', 0.138), ('hiding', 0.131), ('receiver', 0.131), ('substitutes', 0.111), ('security', 0.107), ('checking', 0.107), ('topkara', 0.105), ('secret', 0.101), ('pole', 0.092), ('cover', 0.083), ('atallah', 0.079), ('multimedia', 0.079), ('payload', 0.079), ('colour', 0.079), ('wi', 0.078), ('wordnet', 0.075), ('bits', 0.074), ('graph', 0.074), ('wj', 0.071), ('overlapping', 0.07), ('substitutions', 0.066), ('bitstring', 0.066), ('carrier', 0.066), ('imperceptibility', 0.066), ('shame', 0.066), ('spie', 0.066), ('embed', 0.061), ('embedding', 0.061), ('transformations', 0.058), ('codewords', 0.056), ('mikhail', 0.056), ('bolshakov', 0.052), ('mercan', 0.052), ('winstein', 0.052), ('negatives', 0.051), ('correction', 0.047), ('threshold', 0.045), ('vertices', 0.041), ('contextual', 0.041), ('chromatic', 0.039), ('coloured', 0.039), ('interchangeable', 0.039), ('nounverbadjadv', 0.039), ('pity', 0.039), ('raskin', 0.039), ('sender', 0.039), ('stegotext', 0.039), ('umut', 0.039), ('linguistic', 0.039), ('negative', 0.038), ('closure', 0.037), ('bit', 0.037), ('capacity', 0.036), ('contents', 0.036), ('check', 0.036), ('google', 0.034), ('victor', 0.034), ('target', 0.033), ('jose', 0.033), ('acceptable', 0.032), ('murphy', 0.03), ('transitive', 0.03), ('substitute', 0.03), ('inappropriate', 0.03), ('composition', 0.029), ('annotators', 0.027), ('web', 0.027), ('volume', 0.026), ('carriers', 0.026), ('covert', 0.026), ('exhaust', 0.026), ('hempelmann', 0.026), ('meral', 0.026), ('residue', 0.026), ('residuum', 0.026), ('taskiran', 0.026), ('vybornova', 0.026), ('san', 0.025), ('positive', 0.025), ('thresholds', 0.025), ('block', 0.025), ('encoded', 0.024), ('assigned', 0.024), ('chang', 0.023), ('method', 0.023), ('practical', 0.023), ('cam', 0.022), ('acceptability', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999899 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

Author: Ching-Yun Chang ; Stephen Clark

Abstract: Linguistic Steganography is concerned with hiding information in natural language text. One of the major transformations used in Linguistic Steganography is synonym substitution. However, few existing studies have studied the practical application of this approach. In this paper we propose two improvements to the use of synonym substitution for encoding hidden bits of information. First, we use the Web 1T Google n-gram corpus for checking the applicability of a synonym in context, and we evaluate this method using data from the SemEval lexical substitution task. Second, we address the problem that arises from words with more than one sense, which creates a potential ambiguity in terms of which bits are encoded by a particular word. We develop a novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex colouring algorithm. This method ensures that each word encodes a unique sequence of bits, without cutting out large number of synonyms, and thus maintaining a reasonable embedding capacity.

2 0.10206504 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

3 0.068863101 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

4 0.055369478 77 emnlp-2010-Measuring Distributional Similarity in Context

Author: Georgiana Dinu ; Mirella Lapata

Abstract: The computation of meaning similarity as operationalized by vector-based models has found widespread use in many tasks ranging from the acquisition of synonyms and paraphrases to word sense disambiguation and textual entailment. Vector-based models are typically directed at representing words in isolation and thus best suited for measuring similarity out of context. In his paper we propose a probabilistic framework for measuring similarity in context. Central to our approach is the intuition that word meaning is represented as a probability distribution over a set of latent senses and is modulated by context. Experimental results on lexical substitution and word similarity show that our algorithm outperforms previously proposed models.

5 0.048932761 59 emnlp-2010-Identifying Functional Relations in Web Text

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Determining whether a textual phrase denotes a functional relation (i.e., a relation that maps each domain element to a unique range element) is useful for numerous NLP tasks such as synonym resolution and contradiction detection. Previous work on this problem has relied on either counting methods or lexico-syntactic patterns. However, determining whether a relation is functional, by analyzing mentions of the relation in a corpus, is challenging due to ambiguity, synonymy, anaphora, and other linguistic phenomena. We present the LEIBNIZ system that overcomes these challenges by exploiting the synergy between the Web corpus and freelyavailable knowledge resources such as Freebase. It first computes multiple typedfunctionality scores, representing functionality of the relation phrase when its arguments are constrained to specific types. It then aggregates these scores to predict the global functionality for the phrase. LEIBNIZ outperforms previous work, increasing area under the precisionrecall curve from 0.61 to 0.88. We utilize LEIBNIZ to generate the first public repository of automatically-identified functional relations.

6 0.047940161 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

7 0.043499622 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

8 0.041214023 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

9 0.039406355 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

10 0.03886148 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

11 0.036633104 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

12 0.034425624 95 emnlp-2010-SRL-Based Verb Selection for ESL

13 0.033640478 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

14 0.033220809 52 emnlp-2010-Further Meta-Evaluation of Broad-Coverage Surface Realization

15 0.03230913 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

16 0.031919654 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

17 0.031228865 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

18 0.029628776 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction

19 0.029208653 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

20 0.029047746 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.113), (1, 0.049), (2, -0.058), (3, 0.033), (4, 0.024), (5, 0.028), (6, -0.054), (7, -0.032), (8, -0.015), (9, 0.02), (10, -0.018), (11, 0.037), (12, -0.043), (13, -0.031), (14, -0.008), (15, -0.041), (16, 0.025), (17, 0.006), (18, 0.003), (19, -0.051), (20, -0.013), (21, -0.135), (22, -0.2), (23, 0.013), (24, -0.107), (25, -0.162), (26, -0.164), (27, 0.119), (28, 0.116), (29, 0.058), (30, -0.084), (31, 0.171), (32, 0.168), (33, -0.004), (34, -0.121), (35, -0.189), (36, 0.096), (37, -0.269), (38, 0.075), (39, -0.154), (40, -0.096), (41, -0.019), (42, -0.321), (43, 0.051), (44, -0.185), (45, -0.132), (46, 0.075), (47, 0.079), (48, 0.294), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96216953 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

Author: Ching-Yun Chang ; Stephen Clark

Abstract: Linguistic Steganography is concerned with hiding information in natural language text. One of the major transformations used in Linguistic Steganography is synonym substitution. However, few existing studies have studied the practical application of this approach. In this paper we propose two improvements to the use of synonym substitution for encoding hidden bits of information. First, we use the Web 1T Google n-gram corpus for checking the applicability of a synonym in context, and we evaluate this method using data from the SemEval lexical substitution task. Second, we address the problem that arises from words with more than one sense, which creates a potential ambiguity in terms of which bits are encoded by a particular word. We develop a novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex colouring algorithm. This method ensures that each word encodes a unique sequence of bits, without cutting out large number of synonyms, and thus maintaining a reasonable embedding capacity.

2 0.37495926 59 emnlp-2010-Identifying Functional Relations in Web Text

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Determining whether a textual phrase denotes a functional relation (i.e., a relation that maps each domain element to a unique range element) is useful for numerous NLP tasks such as synonym resolution and contradiction detection. Previous work on this problem has relied on either counting methods or lexico-syntactic patterns. However, determining whether a relation is functional, by analyzing mentions of the relation in a corpus, is challenging due to ambiguity, synonymy, anaphora, and other linguistic phenomena. We present the LEIBNIZ system that overcomes these challenges by exploiting the synergy between the Web corpus and freelyavailable knowledge resources such as Freebase. It first computes multiple typedfunctionality scores, representing functionality of the relation phrase when its arguments are constrained to specific types. It then aggregates these scores to predict the global functionality for the phrase. LEIBNIZ outperforms previous work, increasing area under the precisionrecall curve from 0.61 to 0.88. We utilize LEIBNIZ to generate the first public repository of automatically-identified functional relations.

3 0.29023448 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

Author: Ioannis Klapaftis ; Suresh Manandhar

Abstract: Graph-based methods have gained attention in many areas of Natural Language Processing (NLP) including Word Sense Disambiguation (WSD), text summarization, keyword extraction and others. Most of the work in these areas formulate their problem in a graph-based setting and apply unsupervised graph clustering to obtain a set of clusters. Recent studies suggest that graphs often exhibit a hierarchical structure that goes beyond simple flat clustering. This paper presents an unsupervised method for inferring the hierarchical grouping of the senses of a polysemous word. The inferred hierarchical structures are applied to the problem of word sense disambiguation, where we show that our method performs sig- nificantly better than traditional graph-based methods and agglomerative clustering yielding improvements over state-of-the-art WSD systems based on sense induction.

4 0.24811105 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

5 0.22416595 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

Author: Aida Khemakhem ; Bilel Gargouri ; Abdelmajid Ben Hamadou

Abstract: Electronic dictionaries covering all natural language levels are very relevant for the human use as well as for the automatic processing use, namely those constructed with respect to international standards. Such dictionaries are characterized by a complex structure and an important access time when using a querying system. However, the need of a user is generally limited to a part of such a dictionary according to his domain and expertise level which corresponds to a specialized dictionary. Given the importance of managing a unified dictionary and considering the personalized needs of users, we propose an approach for generating personalized views starting from a normalized dictionary with respect to Lexical Markup Framework LMF-ISO 24613 norm. This approach provides the re-use of already defined views for a community of users by managing their profiles information and promoting the materialization of the generated views. It is composed of four main steps: (i) the projection of data categories controlled by a set of constraints (related to the user‟s profiles), (ii) the selection of values with consistency checking, (iii) the automatic generation of the query‟s model and finally, (iv) the refinement of the view. The proposed approach was con- solidated by carrying out an experiment on an LMF normalized Arabic dictionary. 1

6 0.20630251 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

7 0.20291732 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

8 0.18870784 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

9 0.18537538 77 emnlp-2010-Measuring Distributional Similarity in Context

10 0.18077901 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

11 0.17080531 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

12 0.1533704 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

13 0.1512253 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

14 0.13933137 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

15 0.1392933 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

16 0.13420701 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

17 0.12751281 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

18 0.12637508 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

19 0.1249522 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction

20 0.12457153 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.011), (12, 0.046), (29, 0.061), (30, 0.02), (52, 0.02), (56, 0.031), (66, 0.622), (72, 0.052), (76, 0.016), (89, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99626535 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

Abstract: In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language.

same-paper 2 0.98880041 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

Author: Ching-Yun Chang ; Stephen Clark

Abstract: Linguistic Steganography is concerned with hiding information in natural language text. One of the major transformations used in Linguistic Steganography is synonym substitution. However, few existing studies have studied the practical application of this approach. In this paper we propose two improvements to the use of synonym substitution for encoding hidden bits of information. First, we use the Web 1T Google n-gram corpus for checking the applicability of a synonym in context, and we evaluate this method using data from the SemEval lexical substitution task. Second, we address the problem that arises from words with more than one sense, which creates a potential ambiguity in terms of which bits are encoded by a particular word. We develop a novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex colouring algorithm. This method ensures that each word encodes a unique sequence of bits, without cutting out large number of synonyms, and thus maintaining a reasonable embedding capacity.

3 0.98019701 70 emnlp-2010-Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid

Author: Xin Zhao ; Jing Jiang ; Hongfei Yan ; Xiaoming Li

Abstract: Discovering and summarizing opinions from online reviews is an important and challenging task. A commonly-adopted framework generates structured review summaries with aspects and opinions. Recently topic models have been used to identify meaningful review aspects, but existing topic models do not identify aspect-specific opinion words. In this paper, we propose a MaxEnt-LDA hybrid model to jointly discover both aspects and aspect-specific opinion words. We show that with a relatively small amount of training data, our model can effectively identify aspect and opinion words simultaneously. We also demonstrate the domain adaptability of our model.

4 0.97894847 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

Author: Peter McClanahan ; George Busby ; Robbie Haertel ; Kristian Heal ; Deryle Lonsdale ; Kevin Seppi ; Eric Ringger

Abstract: We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order to facilitate the creation of an annotated corpus. Syriac is an under-resourced Semitic language for which there are no available language tools such as morphological analyzers. We introduce novel probabilistic models for segmentation, dictionary linkage, and morphological tagging and connect them in a pipeline to create a probabilistic morphological analyzer requiring only labeled data. We explore the performance of models with varying amounts of training data and find that with about 34,500 labeled tokens, we can outperform a reasonable baseline trained on over 99,000 tokens and achieve an accuracy of just over 80%. When trained on all available training data, our joint model achieves 86.47% accuracy, a 29.7% reduction in error rate over the baseline.

5 0.97597444 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abil- ities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on nonEnglish corpora.

6 0.88794744 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

7 0.88712752 104 emnlp-2010-The Necessity of Combining Adaptation Methods

8 0.88704091 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

9 0.88101149 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

10 0.87247032 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

11 0.86985821 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

12 0.85852724 114 emnlp-2010-Unsupervised Parse Selection for HPSG

13 0.85149395 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

14 0.84024495 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

15 0.83531272 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

16 0.83504725 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

17 0.83286953 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

18 0.82855576 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

19 0.82036465 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

20 0.81576401 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions