emnlp emnlp2010 emnlp2010-123 knowledge-graph by maker-knowledge-mining

123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules


Source: pdf

Author: Yves Scherrer ; Owen Rambow

Abstract: We present a novel approach for (written) dialect identification based on the discriminative potential of entire words. We generate Swiss German dialect words from a Standard German lexicon with the help of hand-crafted phonetic/graphemic rules that are associated with occurrence maps extracted from a linguistic atlas created through extensive empirical fieldwork. In comparison with a charactern-gram approach to dialect identification, our model is more robust to individual spelling differences, which are frequently encountered in non-standardized dialect writing. Moreover, it covers the whole Swiss German dialect continuum, which trained models struggle to achieve due to sparsity of training data.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Word-based dialect identification with georeferenced rules Yves Scherrer LATL Université de Genève Genève, Switzerland yves . [sent-1, score-0.916]

2 ch Abstract We present a novel approach for (written) dialect identification based on the discriminative potential of entire words. [sent-3, score-0.761]

3 We generate Swiss German dialect words from a Standard German lexicon with the help of hand-crafted phonetic/graphemic rules that are associated with occurrence maps extracted from a linguistic atlas created through extensive empirical fieldwork. [sent-4, score-0.977]

4 In comparison with a charactern-gram approach to dialect identification, our model is more robust to individual spelling differences, which are frequently encountered in non-standardized dialect writing. [sent-5, score-1.454]

5 Moreover, it covers the whole Swiss German dialect continuum, which trained models struggle to achieve due to sparsity of training data. [sent-6, score-0.703]

6 Dialect identification comes in two flavors: spoken dialect ID and written dialect ID. [sent-11, score-1.522]

7 Spoken dialect ID relies on speech recognition techniques which may not cope well with dialectal diversity. [sent-13, score-0.773]

8 Written dialect ID has to deal with non-standardized spellings that may occult real dialectal differences. [sent-17, score-0.773]

9 Moreover, some phonetic distinctions cannot be expressed in orthographic writing systems and limit the input cues in comparison with spoken dialect ID. [sent-18, score-0.838]

10 This paper deals with written dialect ID, applied to the Swiss German dialect area. [sent-19, score-1.446]

11 An important aspect of our model is its conception of the dialect area as a continuum without clear-cut borders. [sent-20, score-0.703]

12 Our dialect ID model follows a bag-of-words approach based on the assumption that every dialectal word form is defined by a probability with which it may occur in each geographic area. [sent-21, score-0.883]

13 The main challenge is to create a lexicon of dialect word forms and their associated probability maps. [sent-23, score-0.805]

14 This linguistic atlas of Swiss German dialects is the result of decades-long empirical fieldwork. [sent-26, score-0.296]

15 We start with an overview of relevant research (Section 2) and present the characteristics of the Swiss German dialect area (Section 3). [sent-28, score-0.703]

16 Section 4 deals with the implementation of word transformation rules and the corresponding extraction of probability maps from the linguistic atlas of German-speaking Switzerland. [sent-29, score-0.3]

17 We present our dialect ID model in Section 5 and discuss its performance in Section 6 by relating it to a baseline n-gram model. [sent-30, score-0.703]

18 (2009) classify speech material from four Arabic dialects plus Modern Standard Arabic. [sent-43, score-0.25]

19 An original approach to the identification of Swiss German dialects has been taken by the ChochichästliOrakel. [sent-46, score-0.288]

20 1 By specifying the pronunciation of ten predefined words, the web site creates a probability map that shows the likelihood of these pronunciations in the Swiss German dialect area. [sent-47, score-0.835]

21 Its derivation from a Standard German lexicon can be viewed as a case of lexicon induction. [sent-50, score-0.168]

22 ch 1152 on an interactive web page (Scherrer, 2010), and we have proposed ideas for reusing this data for machine translation and dialect parsing (Scherrer and Rambow, 2010). [sent-56, score-0.734]

23 3 Swiss German dialects The German-speaking area of Switzerland encompasses the Northeastern two thirds of the Swiss ter- ritory, and about two thirds of the Swiss population define (any variety of) German as their first language. [sent-60, score-0.317]

24 In German-speaking Switzerland, dialects are used in speech, while Standard German is used nearly exclusively in written contexts (diglossia). [sent-61, score-0.27]

25 It follows that all (adult) Swiss Germans are bidialectal: they master their local dialect and Standard German. [sent-62, score-0.703]

26 In addition, they usually have no difficulties understanding Swiss German dialects other than their own. [sent-63, score-0.23]

27 Despite the preference for spoken dialect use, written dialect data has been produced in the form of dialect literature and transcriptions of speech recordings made for scientific purposes. [sent-64, score-2.167]

28 More recently, written dialect has been used in electronic media like blogs, SMS, e-mail and chatrooms. [sent-65, score-0.743]

29 2 However, all this data is very heterogeneous in terms of the dialects used, spelling conventions and genre. [sent-67, score-0.342]

30 4 Georeferenced word transformation rules The key component of the proposed dialect ID model is an automatically generated list of Swiss German word forms, each of which is associated with a map that specifies its likelihood of occurrence over German-speaking Switzerland. [sent-68, score-0.897]

31 org; besides Swiss German, the Alemannic dialect group encompasses Alsatian, South-West German Alemannic and Vorarlberg dialects of Austria. [sent-74, score-0.933]

32 1 Orthography Our system generates written dialect words according to the Dieth spelling conventions without diacritics (Dieth, 1986). [sent-76, score-0.872]

33 3 These are characterized by a transparent grapheme-phone correspondence and are widely used by dialect writers. [sent-77, score-0.703]

34 This lack of standardization is problematic for dialect ID. [sent-79, score-0.703]

35 First, Standard German orthography may unduly influence dialect spelling. [sent-81, score-0.723]

36 Second, dialect writers do not always distinguish short and long vowels, while the Dieth conventions always use letter doubling to indicate vowel lengthening. [sent-83, score-0.804]

37 Future work will incorporate these fluctuations directly into the dialect ID model. [sent-84, score-0.703]

38 2 Phonetic rules Our work is based on the assumption that many words show predictable phonetic differences between Standard German and the different Swiss German dialects. [sent-87, score-0.165]

39 Hence, in many cases, it is not necessary to explicitly model word-to-word correspondences, but a set of phonetic rules suffices to correctly transform words. [sent-88, score-0.165]

40 Each rule nd → and to nt [nt] in Valais and Uri is captured in our system by four nd → nd, nd → ng, nd → nn and nisd georeferenced, in. [sent-92, score-0.199]

41 There is another variant of the Dieth conventions that uses additional diacritics for finer-grained phonetic distinctions. [sent-95, score-0.182]

42 1153 a probability map that specifies its validity in every geographic point. [sent-97, score-0.197]

43 Some rules apply uniformly to all Swiss German dialects (e. [sent-99, score-0.294]

44 These rules do not immediately co[nsttr]ib →ute to the dialect identification task, but they help to obtain correct Swiss German forms that contain other phonemes with better localization potential. [sent-102, score-0.9]

45 3 Lexical rules Some differences at the word level cannot be accounted for by pure phonetic alternations. [sent-107, score-0.165]

46 Standard German und ‘and’ is reduced to u in Bern dialect, where the phonetic rules would rather suggest *ung). [sent-110, score-0.165]

47 5 The linguistic atlas SDS One of the largest research projects in Swiss German dialectology has been the elaboration of the Sprachatlas der deutschen Schweiz (SDS), a linguistic atlas that covers phonetic, morphological and lexical differences of Swiss German dialects. [sent-121, score-0.203]

48 Second, a set of maps may illustrate the same phenomenon with different words and slightly different geographic distributions. [sent-129, score-0.197]

49 As a result, our rule base contains about 300 phonetic rules covering 130 phenomena, 540 lexical rules covering 250 phenomena and 130 morphological rules covering 60 phenomena. [sent-131, score-0.36]

50 We believe this coverage to be sufficient for the dialect ID task. [sent-132, score-0.703]

51 6 Map digitization and interpolation Recall the nd-example used to illustrate the phonetic rules above. [sent-134, score-0.184]

52 ) We also collapse minor phonetic variants which cannot be distinguished in the Dieth spelling system. [sent-142, score-0.181]

53 1), and of a testing procedure that splits a sentence into words, looks up their geographical extensions in the lexicon, and condenses the word-level maps into a sentence-level map (Sections 5. [sent-162, score-0.185]

54 1 Creating a Swiss German lexicon The Swiss German word form lexicon is created with the help of the georeferenced transfer rules presented above. [sent-166, score-0.241]

55 These rules require a lemmatized, POStagged and morphologically disambiguated Standard German word as an input and generate a set of dialect word/map tuples: each resulting dialect word is associated with a probability map that specifies its likelihood in each geographic point. [sent-167, score-1.667]

56 The notation w0 →∗ wn represents an iterative derivation leading from→ a wStandard German word w0 to a dialectal word form wn by the application of n transfer rules of the type wi → wi+1 . [sent-174, score-0.223]

57 The probability of a derivation corresponds →to w the joint probability of the rules it consists of. [sent-175, score-0.198]

58 Hence, the probability map of a derivation is defined as the pointwise product of all rule maps it consists of: t∈∀GSSp(w0→∗ wn|t) =k∏n=−01p(wi→ wi+1|t) Note that in dialectological transition zones, there may be several valid outcomes for a given w0. [sent-176, score-0.369]

59 2 Word lookup and dialect identification At test time, the goal is to compute a probability map for a text segment of unknown origin. [sent-183, score-0.921]

60 The probability map of a text segment depends on the probability maps of the words contained in the segment. [sent-186, score-0.264]

61 The probability map of a word depends on the probability maps of the derivations that yield the word. [sent-188, score-0.299]

62 The probability map of a derivation depends on the probability maps of the rules it consists of. [sent-194, score-0.362]

63 9 The lexicon already contains the probability maps of the derivations (see 5. [sent-198, score-0.247]

64 3 Computing the probability map for a word A dialectal word form may originate in different Stan- dard German words. [sent-202, score-0.186]

65 Again, this is not true: a derivation that is valid in only 10% of the Swiss German dialect area is much more informative than a derivation that is valid in 95% of the dialect area. [sent-220, score-1.582]

66 4 Computing the probability map for a segment The probability of a text segment s can be defined as the joint probability of all words w contained in the segment. [sent-224, score-0.237]

67 Sur1fa8437c% %e Table 1: The six dialect regions selected for our tests, with their annotation on Wikipedia and our abreviation. [sent-236, score-0.803]

68 Figure 4: The localization of the six dialect regions used in our study. [sent-238, score-0.857]

69 Eight dialect categories contained more than 10 articles; we selected six dialects for our experiments (see Table 1 and Figure 4). [sent-240, score-0.96]

70 We compiled a test set consisting of 291 sentences, distributed across the six dialects according to their population size. [sent-241, score-0.304]

71 The gold dialect of these texts could be identified through metadata. [sent-247, score-0.703]

72 The Web data set contains 144 sentences (again dis12We mainly chose websites of local sports and music clubs, whose localization allowed to determine the dialect of their content. [sent-249, score-0.757]

73 The average is weighted by the relative population sizes of the dialect regions. [sent-252, score-0.75]

74 2 Baseline: N-gram model To compare our dialect ID model, we created a baseline system that uses a character-n-gram approach. [sent-258, score-0.703]

75 This approach is fairly common for language ID and has also been successfully applied to dialect ID (Biadsy et al. [sent-259, score-0.703]

76 We trained 2-gram to 6-gram models for each dialect with the SRILM toolkit (Stolcke, 2002), using the Wikipedia development corpus. [sent-262, score-0.703]

77 We scored each sentence of the Wikipedia test set with each dialect model. [sent-263, score-0.703]

78 The predicted dialect was the one which obtained the lowest perplexity. [sent-264, score-0.703]

79 In all our evaluations, the average F-measures for the different dialects are weighted according to the relative population sizes of the dialect regions because the size of the test corpus is proportional to population size (see Section 6. [sent-278, score-1.1]

80 15 We acknowledge that a training corpus of only 100 sentences per dialect provides limited insight into the performance of the n-gram approach. [sent-280, score-0.703]

81 In contrast, our model yields probability 15Roughly, this weighting can be viewed as a prior (the probability of the text being constant): p(dialect | text) = p(text | dialect) ∗ p(dialect) 1158 maps of German-speaking Switzerland. [sent-290, score-0.177]

82 In order to evaluate its performance, we thus had to determine the geographic localization of the six dialect regions defined by the Wikipedia authors (see Table 1). [sent-291, score-0.935]

83 The predicted dialect region of a sentence s is defined as the region in which the most probable point has a higher value than the most probable point in any other region: Region(s) = argR megaioxn? [sent-294, score-0.923]

84 The reason for this discrepancy probably lies in the spelling conventions assumed in the transformation rules: it seems that Web writers are closer to these (implicit) spelling conventions than Wikipedia authors. [sent-306, score-0.287]

85 Table 4: Performances of the word-based model using derivation maps weighted by word frequency. [sent-309, score-0.165]

86 However, one should note that the word-based dialect ID model is not limited on the six dialect regions used for evaluation here. [sent-311, score-1.506]

87 It can be used with any size and number of dialect regions of German-speaking Switzerland. [sent-312, score-0.776]

88 This contrasts with the n-gram model which has to be trained specifically on every dialect region; in this case, the Swiss German Wikipedia only contains two additional dialect regions with an equivalent amount of data. [sent-313, score-1.479]

89 4 Variations In the previous section, we have defined the predicted dialect region as the one in which the most probable point (maximum) has a higher probability than the most probable point of any other region. [sent-315, score-0.87]

90 Therefore, we tested another approach: we defined the predicted dialect region as the one in which the average probability is higher than 1159 Table 5: Performances of the word-based model using derivation maps weighted by their discriminative potential. [sent-318, score-0.985]

91 Geographically large regions like BE tend to have internal dialect variation, and averaging over all dialects in the region leads to low figures. [sent-326, score-1.091]

92 In contrast, small regions show a quite homogeneous dialect landscape that may protrude over adjacent regions. [sent-327, score-0.776]

93 7 Future work In our experiments, the word-based dialect identification model skipped about one third of all words (34% on the Wikipedia test set, 39% on the Web test set) because they could not be found in the lexicon. [sent-335, score-0.761]

94 In the evaluation presented above, the task consisted of identifying the dialect of single sentences. [sent-342, score-0.703]

95 Testing our dialect identification system on the paragraph or document level could thus provide more realistic results. [sent-345, score-0.761]

96 1160 8 Conclusion In this paper, we have compared two empirical methods for the task of dialect identification. [sent-346, score-0.703]

97 We therefore analyze this data in order to extract empirically grounded knowledge for more general use (the creation of the georeferenced rules), and then use this knowledge to perform the dialect ID task in conjunction with an unrelated data source (the Standard German corpus). [sent-350, score-0.763]

98 Structural analysis of dialect maps using methods from spatial statistics. [sent-404, score-0.798]

99 Natural language processing for the Swiss German dialect area. [sent-412, score-0.703]

100 Adaptive string distance measures for bilingual dialect lexicon induction. [sent-416, score-0.752]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dialect', 0.703), ('swiss', 0.43), ('german', 0.272), ('dialects', 0.23), ('phonetic', 0.101), ('id', 0.098), ('maps', 0.095), ('wikipedia', 0.091), ('region', 0.085), ('geographic', 0.078), ('regions', 0.073), ('derivations', 0.071), ('derivation', 0.07), ('dialectal', 0.07), ('dieth', 0.07), ('sds', 0.07), ('map', 0.069), ('conventions', 0.064), ('rules', 0.064), ('georeferenced', 0.06), ('scherrer', 0.06), ('identification', 0.058), ('localization', 0.054), ('alemannic', 0.05), ('switzerland', 0.05), ('lexicon', 0.049), ('spelling', 0.048), ('population', 0.047), ('atlas', 0.046), ('dialectological', 0.043), ('transformation', 0.043), ('written', 0.04), ('nd', 0.039), ('tiger', 0.039), ('segment', 0.036), ('der', 0.036), ('gss', 0.034), ('digitized', 0.034), ('probability', 0.032), ('variants', 0.032), ('yves', 0.031), ('web', 0.031), ('surface', 0.03), ('abreviation', 0.03), ('biadsy', 0.03), ('geographically', 0.03), ('six', 0.027), ('western', 0.027), ('articles', 0.027), ('rule', 0.026), ('performances', 0.026), ('probable', 0.025), ('phenomenon', 0.024), ('inquiry', 0.023), ('lookup', 0.023), ('phenomena', 0.023), ('geographical', 0.021), ('rambow', 0.021), ('forms', 0.021), ('linguistic', 0.02), ('material', 0.02), ('argr', 0.02), ('cavnar', 0.02), ('cherle', 0.02), ('cmb', 0.02), ('hotzenk', 0.02), ('hughes', 0.02), ('immer', 0.02), ('interdialectal', 0.02), ('megaioxn', 0.02), ('orthography', 0.02), ('rudolf', 0.02), ('rumpf', 0.02), ('schafer', 0.02), ('sprachatlas', 0.02), ('symbolized', 0.02), ('thirds', 0.02), ('writers', 0.02), ('transfer', 0.019), ('interpolation', 0.019), ('morphological', 0.018), ('weighting', 0.018), ('spoken', 0.018), ('valid', 0.018), ('specifies', 0.018), ('diacritics', 0.017), ('ccls', 0.017), ('deutschen', 0.017), ('vowel', 0.017), ('gen', 0.017), ('rek', 0.017), ('zh', 0.017), ('nt', 0.017), ('pointwise', 0.016), ('columbia', 0.016), ('cues', 0.016), ('columns', 0.016), ('mann', 0.015), ('originate', 0.015), ('irregular', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

Author: Yves Scherrer ; Owen Rambow

Abstract: We present a novel approach for (written) dialect identification based on the discriminative potential of entire words. We generate Swiss German dialect words from a Standard German lexicon with the help of hand-crafted phonetic/graphemic rules that are associated with occurrence maps extracted from a linguistic atlas created through extensive empirical fieldwork. In comparison with a charactern-gram approach to dialect identification, our model is more robust to individual spelling differences, which are frequently encountered in non-standardized dialect writing. Moreover, it covers the whole Swiss German dialect continuum, which trained models struggle to achieve due to sparsity of training data.

2 0.080472127 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

3 0.057993911 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

4 0.054590061 84 emnlp-2010-NLP on Spoken Documents Without ASR

Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church

Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.

5 0.039839894 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

Author: Adria de Gispert ; Juan Pino ; William Byrne

Abstract: We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignmentmodel. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteri- ors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-totarget and target-to-source alignment models is to build two separate systems and combine their output translation lattices.

6 0.034101028 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

7 0.032570019 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

8 0.031495381 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

9 0.030262413 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

10 0.030033728 39 emnlp-2010-EMNLP 044

11 0.028341701 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

12 0.027588693 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

13 0.027514212 94 emnlp-2010-SCFG Decoding Without Binarization

14 0.026258012 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

15 0.026189575 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

16 0.025724996 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

17 0.025607789 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

18 0.025246644 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

19 0.025108062 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

20 0.024522584 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.099), (1, 0.031), (2, -0.021), (3, -0.006), (4, 0.035), (5, -0.009), (6, -0.057), (7, -0.041), (8, -0.04), (9, -0.054), (10, 0.005), (11, -0.019), (12, -0.016), (13, 0.028), (14, 0.01), (15, -0.022), (16, -0.098), (17, -0.032), (18, -0.02), (19, -0.027), (20, 0.017), (21, -0.134), (22, 0.011), (23, 0.106), (24, -0.058), (25, -0.001), (26, -0.012), (27, 0.065), (28, -0.051), (29, -0.144), (30, 0.055), (31, -0.174), (32, 0.09), (33, 0.144), (34, -0.222), (35, 0.042), (36, -0.254), (37, -0.136), (38, 0.13), (39, 0.429), (40, 0.163), (41, 0.005), (42, 0.106), (43, -0.004), (44, -0.327), (45, -0.038), (46, 0.074), (47, 0.329), (48, -0.002), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97016656 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

Author: Yves Scherrer ; Owen Rambow

Abstract: We present a novel approach for (written) dialect identification based on the discriminative potential of entire words. We generate Swiss German dialect words from a Standard German lexicon with the help of hand-crafted phonetic/graphemic rules that are associated with occurrence maps extracted from a linguistic atlas created through extensive empirical fieldwork. In comparison with a charactern-gram approach to dialect identification, our model is more robust to individual spelling differences, which are frequently encountered in non-standardized dialect writing. Moreover, it covers the whole Swiss German dialect continuum, which trained models struggle to achieve due to sparsity of training data.

2 0.38004911 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

3 0.27282485 84 emnlp-2010-NLP on Spoken Documents Without ASR

Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church

Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.

4 0.20042957 94 emnlp-2010-SCFG Decoding Without Binarization

Author: Mark Hopkins ; Greg Langmead

Abstract: Conventional wisdom dictates that synchronous context-free grammars (SCFGs) must be converted to Chomsky Normal Form (CNF) to ensure cubic time decoding. For arbitrary SCFGs, this is typically accomplished via the synchronous binarization technique of (Zhang et al., 2006). A drawback to this approach is that it inflates the constant factors associated with decoding, and thus the practical running time. (DeNero et al., 2009) tackle this problem by defining a superset of CNF called Lexical Normal Form (LNF), which also supports cubic time decoding under certain implicit assumptions. In this paper, we make these assumptions explicit, and in doing so, show that LNF can be further expanded to a broader class of grammars (called “scope3”) that also supports cubic-time decoding. By simply pruning non-scope-3 rules from a GHKM-extracted grammar, we obtain better translation performance than synchronous binarization.

5 0.17052802 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

6 0.15526523 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

7 0.15034312 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

8 0.13889571 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

9 0.13471654 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

10 0.1268957 59 emnlp-2010-Identifying Functional Relations in Web Text

11 0.11922247 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

12 0.11104647 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

13 0.10896178 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

14 0.10365929 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

15 0.097694807 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

16 0.095342435 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

17 0.09426342 39 emnlp-2010-EMNLP 044

18 0.092893042 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

19 0.091269746 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

20 0.091144688 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.013), (4, 0.012), (10, 0.032), (12, 0.039), (29, 0.093), (30, 0.012), (32, 0.015), (52, 0.032), (55, 0.31), (56, 0.053), (62, 0.014), (66, 0.09), (72, 0.078), (76, 0.044), (87, 0.012), (89, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71416301 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

Author: Yves Scherrer ; Owen Rambow

Abstract: We present a novel approach for (written) dialect identification based on the discriminative potential of entire words. We generate Swiss German dialect words from a Standard German lexicon with the help of hand-crafted phonetic/graphemic rules that are associated with occurrence maps extracted from a linguistic atlas created through extensive empirical fieldwork. In comparison with a charactern-gram approach to dialect identification, our model is more robust to individual spelling differences, which are frequently encountered in non-standardized dialect writing. Moreover, it covers the whole Swiss German dialect continuum, which trained models struggle to achieve due to sparsity of training data.

2 0.45724455 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

3 0.45210379 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

4 0.44968233 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

5 0.44580844 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

6 0.44521514 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

7 0.44457638 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

8 0.44193539 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

9 0.44142649 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

10 0.44116461 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

11 0.44010848 84 emnlp-2010-NLP on Spoken Documents Without ASR

12 0.43919632 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

13 0.43811655 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

14 0.4376252 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

15 0.4371227 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

16 0.43596438 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

17 0.43555441 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

18 0.43535486 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

19 0.43503046 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

20 0.43451324 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors