emnlp emnlp2010 emnlp2010-6 knowledge-graph by maker-knowledge-mining

6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Source: pdf

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu , , , Abstract The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. [sent-5, score-0.409]

2 In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. [sent-6, score-0.587]

3 High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. [sent-7, score-0.843]

4 Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. [sent-8, score-0.934]

5 The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models. [sent-9, score-0.741]

6 1 Introduction Sociolinguistics and dialectology study how language varies across social and regional contexts. [sent-10, score-0.502]

7 1277 One challenge in the study of lexical variation is that term frequencies are influenced by a variety of factors, such as the topic of discourse. [sent-15, score-0.367]

8 We address this issue by adding latent variables that allow us to model topical variation explicitly. [sent-16, score-0.253]

9 We hypothesize that geography and topic interact, as “pure” topical lexical distributions are corrupted by geographical factors; for example, a sports-related topic will be rendered differently in New York and California. [sent-17, score-1.186]

10 Each author is imbued with a latent “region” indicator, which both selects the regional variant of each topic, and generates the author’s observed geographical location. [sent-18, score-0.891]

11 The regional corruption of topics is modeled through a cascade of logistic normal priors—a general modeling approach which we call cascading topic models. [sent-19, score-1.06]

12 The resulting system has multiple capabilities, including: (i) analyzing lexical variation by both topic and geography; (ii) segmenting geographical space into coherent linguistic communities; (iii) predicting author location based on text alone. [sent-20, score-0.977]

13 Many users of Twitter also supply exact geographical coordinates from GPS-enabled devices (e. [sent-23, score-0.383]

14 Text in computer-mediated communication is often more vernacular (Tagliamonte and Denis, 2008), and as such it is more likely to reveal the influence of geographic factors than text written in a more formal genre, such as news text (Labov, 1966). [sent-26, score-0.289]

15 We aggressively filter this stream, using only messages that are tagged with physical (latitude, longitude) coordinate pairs from a mobile client, and whose authors wrote at least 20 messages over this period. [sent-38, score-0.27]

16 Every message is tagged with a location, but most messages from a single individual tend to come from nearby locations (as they go about their day); for modeling purposes we use only a single geographic location for each author, simply taking the location of the first message in the sample. [sent-60, score-0.718]

17 3 Model We develop a model that incorporates two sources of lexical variation: topic and geographical region. [sent-64, score-0.583]

18 We treat the text and geographic locations as outputs from a generative process that incorporates both topics and regions as latent variables. [sent-65, score-0.736]

19 6 During inference, we seek to recover the topics and regions that best explain the observed data. [sent-66, score-0.361]

20 At the base level of model are “pure” topics (such as “sports”, “weather”, or “slang”); these topics are rendered differently in each region. [sent-67, score-0.454]

21 We call this general modeling approach a cascading topic model; we describe it first in general terms before moving to the specific application to geographical variation. [sent-68, score-0.701]

22 1 Cascading Topic Models Cascading topic models generate text from a chain of random variables. [sent-70, score-0.239]

23 At the beginning of the chain are the priors, followed by unadulerated base topics, which may then be corrupted by other factors (such as geography or time). [sent-74, score-0.336]

24 For example, consider a base “food” topic 6The region could be observed by using a predefined geographical decomposition, e. [sent-75, score-0.851]

25 that emphasizes words like dinner and delicious; the corrupted “food-California” topic would place weight on these words, but might place extra emphasis on other words like sprouts. [sent-79, score-0.298]

26 , 2003), the base topics are selected by a per-token hidden variable z. [sent-82, score-0.284]

27 In the geographical topic model, the next level corresponds to regions, which are selected by a per-author latent variable r. [sent-83, score-0.7]

28 Formally, we draw each level of the cascade from a normal distribution centered on the previous level; the final multinomial distribution over words is obtained by exponentiating and normalizing. [sent-84, score-0.347]

29 2 The Geographic Topic Model The application of cascading topic models to geographical variation is straightforward. [sent-89, score-0.789]

30 For each author, the latent variable r corresponds to the geographical region of the author, which is not observed. [sent-91, score-0.657]

31 As described above, r selects a corrupted version of each topic: the kth basic topic has mean µk, with uniform diagonal covariance σk2; for region j, we can draw the regionally- Nco(rµrukp,tσe2dkI t)o. [sent-92, score-0.744]

32 Given a vocabulary size W, the generative story is as follows: • Generate base topics: for each topic k < K Draw the base topic from a normal distribution with uniform diagonal covariance: µk ∼ N(a, b2I), Draw the regional variance from a Gamma distribution: σk2 ∼ G (c, d). [sent-100, score-1.221]

33 Generate regional variants: for each region j < J, ∗ Draw the region-topic ηjk from a normal – – – ∗ dDirsatrwibu thtieon re gwioithn- tuonpifiocr ηm diagonal covariance: ηjk ∼ N(µk, σk2I). [sent-101, score-0.705]

34 Generate text and locations: for each document d, Draw topic proportions from a symmetric Dirichlet prior, θ ∼ Dir(α1). [sent-110, score-0.277]

35 Draw the region r from the multinomial distribution ϑ. [sent-111, score-0.266]

36 4 Inference We apply mean-field variational inference: a fullyfactored variational distribution Q is chosen to minimize the Kullback-Leibler divergence from the true distribution. [sent-115, score-0.411]

37 Mean-field variational inference with conjugate priors is described in detail elsewhere (Bishop, 2006; Wainwright and Jordan, 2008); we restrict our focus to the issues that are unique to the geographic topic model. [sent-116, score-0.708]

38 µ, We place variational distributions over all latent variables of interest: θ, z, r,ϑ, η, σ2, ν, and Λ, updating each of these distributions in turn, until convergence. [sent-118, score-0.369]

39 The variational distributions over θ and ϑ are Dirichlet, and have closed form updates: each can be set to the sum of the expected counts, plus a term from the prior (Blei et al. [sent-119, score-0.341]

40 The variational distributions q(z) and q(r) are categorical, and can be set proportional to the expected joint likelihood—to set q(z) we marginalize over r, and vice versa. [sent-121, score-0.264]

41 1 Regional Word Distributions The variational region-topic distribution ηjk is normal, with uniform diagonal covariance for tractability. [sent-124, score-0.349]

42 As in previous work on logistic normal topic models, we use a Taylor approximation for this term (Blei and Lafferty, 2006a). [sent-130, score-0.395]

43 We introduce the following notation for expected counts: N(i, j,k) indicates the expected count of term iin region j and topic k, and N(j, k) = Pi N(i, j,k). [sent-132, score-0.549]

44 The first two terms represent the difference in expected counts for µ term iunder the variational distributions q(z, r) and this difference goes to zero when βj(ki) q(z,r,β): perfectly matches N(i, j,k)/N(j, k). [sent-134, score-0.304]

45 2 Base Topics The base topic parameters are µk and σk2; in the variational distribution, q(µk) is normally distributed and q(σk2) is Gamma distributed. [sent-138, score-0.5]

46 Note that µk and σk2 affect only the regional word distributions ηjk. [sent-139, score-0.422]

47 An advantage of the logistic normal is that the variational parameters over µk are available in closed form, hσ−k2i. [sent-140, score-0.305]

48 The expectation of the base topic incorporates the prior and the average of the generated region-topics— these two components are weighted respectively by the expected variance of the region-topics hσ2ki and tthhee prior topical viaanrciaen ocef tbh2e. [sent-142, score-0.54]

49 rTegheio posterior hvσariian anced V(µ) is a harmonic combination of the prior variance b2 and the expected variance of the region topics. [sent-143, score-0.384]

50 The variational distribution over the region-topic variance σk2 has Gamma parameters. [sent-144, score-0.279]

51 In our implementation, the variational updates are scheduled as follows: given expected counts, we iteratively update the variational parameters on the region-topics η and the base topics until convergence. [sent-148, score-0.704]

52 We then update the geographical parameters ν and as well as the distribution over regions ϑ. [sent-149, score-0.569]

53 Finally, for each doc- µ, Λ, ument we iteratively update the variational parameters over θ, z, and r until convergence, obtaining expected counts that are used in the next iteration of updates for the topics and their regional variants. [sent-150, score-0.827]

54 First we train a Dirichlet process mixture model on 1281 the locations y, using variational inference on the truncated stick-breaking approximation (Blei and Jordan, 2006). [sent-153, score-0.295]

55 This automatically selects the number of regions J, and gives a distribution over each region indicator rd from geographical information alone. [sent-154, score-0.765]

56 The prior a is the expected mean of each topic µ;for each term i, we set = log N(i) log N, where PN(i) is the total count o=f =i l oing Nthe( corpus aNnd, a(i) − N = Pi N(i). [sent-157, score-0.396]

57 Finally, the geographical model takes priors that are linked to the data: for each region, the mean is very weakly encouraged to be near the overall mean, and the covariance prior is set by the average covariance ofclusters obtained by running K-means. [sent-160, score-0.618]

58 21, 6 Evaluation For a quantitative evaluation of the estimated relationship between text and geography, we assess our model’s ability to predict the geographic location of unlabeled authors based on their text alone. [sent-161, score-0.397]

59 8Alternatively, one might evaluate the attributed regional memberships of the words themselves. [sent-166, score-0.384]

60 To predict the unseen location yd, we iterate until convergence on the variational updates for the hidden topics zd, the topic proportions θd, and the region rd. [sent-171, score-1.03]

61 Mixture of Unigrams A core premise of our approach is that modeling topical variation will improve our ability to understand geographical variation. [sent-174, score-0.527]

62 This is equivalent to a Bayesian mixture of unigrams in which each author is assigned a single, regional unigram language model that generates all of his or her text. [sent-176, score-0.556]

63 This model is equivalent to supervised latent Dirichlet allocation (Blei and McAuliffe, 2007): each topic is associated with equivariant Gaussian distributions over the latitude and longitude, and these topics must explain both the text and the observed geographical locations. [sent-180, score-0.988]

64 For unlabeled authors, we estimate latitude and longitude by estimating the topic proportions and then applying the learned geographical distributions. [sent-181, score-0.758]

65 This is a linear prediction f(¯ zd; a) = for an author’s ( ¯zdTalat, z¯dTalon) topic proportions z¯d and topic- geography weights a ∈ R2K. [sent-182, score-0.439]

66 Both the geographic topic model and supervised LDA use the best number of topics from the development set (10 and 5, respectively). [sent-205, score-0.654]

67 3 Results As shown in Table 1, the geographic topic model achieves the strongest performance on all metrics. [sent-212, score-0.485]

68 Note that the geographic topic model and the mixture of unigrams use identical code and parametrization the only difference is that the geographic topic model accounts for topical variation, while the mixture of unigrams sets K = 1. [sent-216, score-1.197]

69 These results validate our basic premise that it is important to model the interaction between topical and geographical variation. [sent-217, score-0.439]

70 pdf 1283 Number of topics Figure 2: The effect of varying the number of topics on the median regression error (lower is better). [sent-226, score-0.443]

71 each word in the document: in text regression, each word is directly multiplied by a feature weight; in supervised LDA the word is associated with a latent topic first, and then multiplied by a weight. [sent-227, score-0.313]

72 Of course it is always possible to optimize classification accuracy directly, but such an approach would be incapable of predicting the exact geographical location, which is the focus of our evaluation (given that the desired geographical partition is unknown). [sent-231, score-0.688]

73 Note that the geographic topic model is also not trained to optimize classification accuracy. [sent-232, score-0.485]

74 “CPLAJbISEDVaTKSUsOEKkSNRTEeSUNtIKgbCBOaKmABElSeY”bpaI“rTl#tsobmUpdLuNamoArcEKtpsieSmEudtcRlovs”aSuiodcrlew“g todaeikn yetginldhcygtisclheoiatmufnpeg”t:)“hxamd:o/htdiac:h(o;n)a:s”p“lwocyhmsdiacethosiujknpmayjet” Table 2: Example base topics (top line) and regional variants. [sent-233, score-0.625]

75 The regional variants show words that are strong compared to both the base topic and the background. [sent-235, score-0.695]

76 See Table 3 for definitions of slang terms; see Section 7 for more explanation and details Figure 3: Regional clustering of the training set obtained by one randomly-initialized run of the geographical topic model. [sent-238, score-0.65]

77 1284 7 Analysis Our model permits analysis of geographical variation in the context of topics that help to clarify the significance of geographically-salient terms. [sent-242, score-0.601]

78 Table 2 shows a subset of the results of one randomlyinitialized run, including five hand-chosen topics (of 50 total) and five regions (of 13, as chosen automatically during initialization). [sent-243, score-0.361]

79 For the base topics we show the ten strongest terms in each topic as compared to the background word distribution. [sent-245, score-0.48]

80 For the regional variants, we show terms that are strong both regionally and topically: specifically, we select terms that are in the top 100 compared to both the background distribution and to the base topic. [sent-246, score-0.489]

81 The names for the topics and regions were chosen by the authors. [sent-247, score-0.361]

82 Spanish-language terms (papi, pues, nada, ese) tend to appear in regions with large Spanish-speaking populations—it is also telling that these terms appear in topics with emoticons and slang abbreviations, which may transcend linguistic barriers. [sent-251, score-0.463]

83 A large number of slang terms are found to have strong regional biases, suggesting that slang may depend on geography more than standard English does. [sent-253, score-0.68]

84 The terms af and hella display especially strong regional affinities, appearing in the regional variants of multiple topics (see Table 3 for definitions). [sent-254, score-0.989]

85 , 2007), we caution that our findings are merely suggestive, and a more rigorous analysis must be undertaken before making definitive statements about the regional membership of individual terms. [sent-263, score-0.384]

86 We view the geographic topic model as an exploratory tool that may be used to facilitate such investigations. [sent-264, score-0.519]

87 Figure 3 shows the regional clustering on the training set obtained by one run of the model. [sent-265, score-0.384]

88 There are nine compact regions for major metropolitan areas, two slightly larger regions that encompass Florida and the area around Lake Erie, and two large regions that partition the country roughly into north and south. [sent-267, score-0.576]

89 8 Related Work The relationship between language and geography has been a topic of interest to linguists since the nineteenth century (Johnstone, 2010). [sent-268, score-0.401]

90 This research identifies the geographic distribution of individual queries and tags, but does not attempt to induce any structural organization of either the text or geographical space, which is the focus of our research. [sent-279, score-0.623]

91 (2006), in which the distribution over latent topics in blog posts is conditioned on the geographical location of the author. [sent-281, score-0.771]

92 This is somewhat similar to the supervised LDA model that we consider, but their approach assumes that a partitioning of geographical space into regions is already given. [sent-282, score-0.536]

93 Methodologically, our cascading topic model is designed to capture multiple dimensions of variability: topics and geography. [sent-283, score-0.526]

94 (2007) include sentiment as a second dimension in a topic model, using a switching variable so that individual word tokens may be selected from either the topic or the sentiment. [sent-285, score-0.521]

95 However, our hypothesis is that individ- ual word tokens reflect both the topic and the geographical aspect. [sent-286, score-0.583]

96 The use of cascading logistic normal distri1286 butions in topic models follows earlier work on dynamic topic models (Blei and Lafferty, 2006b; Xing, 2005). [sent-290, score-0.712]

97 9 Conclusion This paper presents a model that jointly identifies words with high regional affinity, geographicallycoherent linguistic regions, and the relationship between regional and topic variation. [sent-291, score-1.042]

98 The key modeling assumption is that regions and topics interact to shape observed lexical frequencies. [sent-292, score-0.361]

99 We validate this assumption on a prediction task in which our model outperforms strong alternatives that do not distinguish regional and topical variation. [sent-293, score-0.445]

100 Indeed, in a study of morphosyntactic variation, Szmrecsanyi (2010) finds that by the most generous measure, geographical factors account for only 33% of the observed variation. [sent-295, score-0.387]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('regional', 0.384), ('geographical', 0.344), ('geographic', 0.246), ('topic', 0.239), ('region', 0.196), ('regions', 0.192), ('jk', 0.19), ('variational', 0.189), ('topics', 0.169), ('geography', 0.162), ('location', 0.151), ('ki', 0.119), ('cascading', 0.118), ('messages', 0.115), ('regression', 0.105), ('twitter', 0.092), ('author', 0.089), ('variation', 0.088), ('blei', 0.088), ('draw', 0.08), ('covariance', 0.08), ('longitude', 0.08), ('normal', 0.078), ('latent', 0.074), ('dirichlet', 0.074), ('base', 0.072), ('geotagged', 0.069), ('tagliamonte', 0.069), ('slang', 0.067), ('topical', 0.061), ('corrupted', 0.059), ('dialectology', 0.059), ('gamma', 0.059), ('xd', 0.059), ('social', 0.059), ('spatial', 0.057), ('latitude', 0.057), ('variance', 0.057), ('locations', 0.055), ('backstrom', 0.052), ('exponentiating', 0.052), ('hella', 0.052), ('labov', 0.052), ('mixture', 0.051), ('updates', 0.048), ('diagonal', 0.047), ('rendered', 0.044), ('variable', 0.043), ('factors', 0.043), ('mean', 0.043), ('mei', 0.041), ('lda', 0.04), ('mobile', 0.04), ('term', 0.04), ('united', 0.039), ('users', 0.039), ('distributions', 0.038), ('logistic', 0.038), ('proportions', 0.038), ('sports', 0.038), ('expected', 0.037), ('prior', 0.037), ('connor', 0.037), ('yd', 0.037), ('multinomial', 0.037), ('gaussian', 0.037), ('gradient', 0.036), ('linguistic', 0.035), ('bivariate', 0.034), ('bucholtz', 0.034), ('cassidy', 0.034), ('compass', 0.034), ('crandall', 0.034), ('ellipses', 0.034), ('equivariant', 0.034), ('erie', 0.034), ('exploratory', 0.034), ('forthcoming', 0.034), ('goldvarb', 0.034), ('hlogp', 0.034), ('interviews', 0.034), ('jik', 0.034), ('jki', 0.034), ('kwak', 0.034), ('musician', 0.034), ('perceptual', 0.034), ('premise', 0.034), ('rap', 0.034), ('sankoff', 0.034), ('sociolinguistic', 0.034), ('xdtalat', 0.034), ('xdtalon', 0.034), ('cascade', 0.034), ('contiguous', 0.034), ('priors', 0.034), ('allocation', 0.033), ('distribution', 0.033), ('unigrams', 0.032), ('coherent', 0.031), ('variables', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

2 0.20720686 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

3 0.16352212 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

4 0.11891365 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

Author: Shafiq Joty ; Giuseppe Carenini ; Gabriel Murray ; Raymond T. Ng

Abstract: This work concerns automatic topic segmentation of email conversations. We present a corpus of email threads manually annotated with topics, and evaluate annotator reliability. To our knowledge, this is the first such email corpus. We show how the existing topic segmentation models (i.e., Lexical Chain Segmenter (LCSeg) and Latent Dirichlet Allocation (LDA)) which are solely based on lexical information, can be applied to emails. By pointing out where these methods fail and what any desired model should consider, we propose two novel extensions of the models that not only use lexical information but also exploit finer level conversation structure in a principled way. Empirical evaluation shows that LCSeg is a better model than LDA for segmenting an email thread into topical clusters and incorporating conversation structure into these models improves the performance significantly.

5 0.10581649 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

Author: Christina Sauper ; Aria Haghighi ; Regina Barzilay

Abstract: In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.1

6 0.095819883 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

7 0.090178981 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

8 0.089278832 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

9 0.084001467 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

10 0.082706489 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

11 0.080472127 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

12 0.080145769 84 emnlp-2010-NLP on Spoken Documents Without ASR

13 0.073213361 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

14 0.072496586 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

15 0.065672517 77 emnlp-2010-Measuring Distributional Similarity in Context

16 0.061052695 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

17 0.056667075 70 emnlp-2010-Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid

18 0.056580335 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

19 0.050421331 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

20 0.049592115 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.193), (1, 0.176), (2, -0.178), (3, -0.175), (4, 0.188), (5, 0.026), (6, -0.085), (7, -0.063), (8, -0.027), (9, -0.111), (10, -0.009), (11, -0.001), (12, -0.127), (13, 0.15), (14, 0.039), (15, 0.041), (16, -0.085), (17, 0.042), (18, 0.069), (19, -0.091), (20, 0.101), (21, -0.092), (22, 0.055), (23, 0.121), (24, 0.001), (25, 0.065), (26, 0.095), (27, 0.038), (28, -0.027), (29, -0.099), (30, -0.007), (31, -0.051), (32, 0.117), (33, -0.005), (34, -0.052), (35, 0.015), (36, -0.148), (37, -0.016), (38, -0.005), (39, 0.047), (40, 0.155), (41, -0.071), (42, 0.101), (43, 0.029), (44, -0.146), (45, 0.106), (46, 0.081), (47, 0.0), (48, -0.053), (49, -0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96120995 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

2 0.70664006 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

3 0.67948204 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

4 0.53087044 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

Author: Yves Scherrer ; Owen Rambow

Abstract: We present a novel approach for (written) dialect identification based on the discriminative potential of entire words. We generate Swiss German dialect words from a Standard German lexicon with the help of hand-crafted phonetic/graphemic rules that are associated with occurrence maps extracted from a linguistic atlas created through extensive empirical fieldwork. In comparison with a charactern-gram approach to dialect identification, our model is more robust to individual spelling differences, which are frequently encountered in non-standardized dialect writing. Moreover, it covers the whole Swiss German dialect continuum, which trained models struggle to achieve due to sparsity of training data.

5 0.52720118 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

Author: Zhiyuan Liu ; Wenyi Huang ; Yabin Zheng ; Maosong Sun

Abstract: Existing graph-based ranking methods for keyphrase extraction compute a single importance score for each word via a single random walk. Motivated by the fact that both documents and words can be represented by a mixture of semantic topics, we propose to decompose traditional random walk into multiple random walks specific to various topics. We thus build a Topical PageRank (TPR) on word graph to measure word importance with respect to different topics. After that, given the topic distribution of the document, we further calculate the ranking scores of words and extract the top ranked ones as keyphrases. Experimental results show that TPR outperforms state-of-the-art keyphrase extraction methods on two datasets under various evaluation metrics.

6 0.52113205 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

7 0.51942301 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

8 0.39304769 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

9 0.3613812 84 emnlp-2010-NLP on Spoken Documents Without ASR

10 0.35764521 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

11 0.3488473 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

12 0.33856764 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

13 0.32437888 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

14 0.31152064 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

15 0.29569447 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

16 0.24645388 77 emnlp-2010-Measuring Distributional Similarity in Context

17 0.21591817 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

18 0.20115247 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

19 0.19427255 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

20 0.18385249 102 emnlp-2010-Summarizing Contrastive Viewpoints in Opinionated Text

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.019), (10, 0.015), (12, 0.019), (17, 0.05), (29, 0.111), (30, 0.066), (32, 0.022), (52, 0.027), (56, 0.069), (62, 0.019), (66, 0.082), (72, 0.037), (76, 0.032), (77, 0.031), (79, 0.013), (82, 0.016), (83, 0.016), (87, 0.025), (89, 0.247)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88182855 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

same-paper 2 0.83443367 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

3 0.75241709 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

Author: Amarnag Subramanya ; Slav Petrov ; Fernando Pereira

Abstract: We describe a new scalable algorithm for semi-supervised training of conditional random fields (CRF) and its application to partof-speech (POS) tagging. The algorithm uses a similarity graph to encourage similar ngrams to have similar POS tags. We demonstrate the efficacy of our approach on a domain adaptation task, where we assume that we have access to large amounts of unlabeled data from the target domain, but no additional labeled data. The similarity graph is used during training to smooth the state posteriors on the target domain. Standard inference can be used at test time. Our approach is able to scale to very large problems and yields significantly improved target domain accuracy.

4 0.57285249 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

5 0.56673443 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

Author: Ioannis Klapaftis ; Suresh Manandhar

Abstract: Graph-based methods have gained attention in many areas of Natural Language Processing (NLP) including Word Sense Disambiguation (WSD), text summarization, keyword extraction and others. Most of the work in these areas formulate their problem in a graph-based setting and apply unsupervised graph clustering to obtain a set of clusters. Recent studies suggest that graphs often exhibit a hierarchical structure that goes beyond simple flat clustering. This paper presents an unsupervised method for inferring the hierarchical grouping of the senses of a polysemous word. The inferred hierarchical structures are applied to the problem of word sense disambiguation, where we show that our method performs sig- nificantly better than traditional graph-based methods and agglomerative clustering yielding improvements over state-of-the-art WSD systems based on sense induction.

6 0.5562132 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

7 0.55604076 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

8 0.55413818 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

9 0.55266279 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

10 0.55226201 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

11 0.54851681 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

12 0.54416221 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

13 0.54273343 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

14 0.54123688 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

15 0.53878462 84 emnlp-2010-NLP on Spoken Documents Without ASR

16 0.537664 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

17 0.53733987 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

18 0.53503877 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

19 0.53372055 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

20 0.53220189 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation