emnlp emnlp2010 emnlp2010-100 knowledge-graph by maker-knowledge-mining

100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective


Source: pdf

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 amahmed@ c s cmu edu Abstract With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. [sent-3, score-0.205]

2 While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. [sent-4, score-0.473]

3 In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. [sent-5, score-0.7]

4 We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. [sent-6, score-0.309]

5 In this paper, we follow the notion of ideology as defines by Van Dijk in (Dijk, 1998) as “a set of general abstract beliefs commonly shared by a group of people. [sent-9, score-0.475]

6 ” In other words, an ideology is a set of ideas that directs one’s goals, expectations, and actions. [sent-10, score-0.509]

7 We can attribute the lexical variations of the word content of a document to three factors: • • • Writer Ideological Belief. [sent-15, score-0.242]

8 om and choice regardless of the topical content of the document. [sent-17, score-0.208]

9 These words define the abstract notion of belief held by the writer and its frequency in the document largely depends on the writer’s style. [sent-18, score-0.273]

10 For instance, a document about abortion is more likely to have facts related to abortion, health, marriage and relationships. [sent-22, score-0.273]

11 Given a collection of ideologically-labeled documents, our goal is to develop a computer model that factors the document collection into a representation that reflects the aforementioned three sources of lexical variations. [sent-26, score-0.269]

12 i nB yea vcihsu ideology, ea andbs ttrhaec way each ideology approaches and views mainstream topics, the user can view and contrast each ideology side-by-side and build the right mental landscape that acts as the basis for his/her future decision making. [sent-30, score-1.151]

13 GCilavesnsi a document, we would like to tell the user from which side it was written, and explain the ideological bias in the document at a topical level. [sent-34, score-0.704]

14 g Given a document written from per- spective A, we would like the model to provide the user with other documents that represent alternative views about the same topic addressed in the original document. [sent-36, score-0.61]

15 We introduce a factored topic model that we call multi-view Latent Dirichlet Allocation or mview-LDA for short. [sent-39, score-0.275]

16 Our model views the word content of each document as the result of the interaction between the document’s idealogical and topical dimensions. [sent-40, score-0.553]

17 In contrast,in modeling ideology, we aim toward contrasting two or more ideological perspectives each of which is subjective in nature. [sent-50, score-0.314]

18 The research goal of sentiment analysis and classification is to identify language used to convey positive and negative opinions, which dif- fers from contrasting two ideological perspectives. [sent-57, score-0.281]

19 While ideology can be expressed in the form of a sentiment toward a given topic,like abortion, ideological perspectives are reflected in many ways other than sentiments as we will illustrate later in the paper. [sent-58, score-0.796]

20 However, this work still addresses ideology at an abstract level as opposed to our approach of modeling ideology at a topical level. [sent-62, score-1.121]

21 3 Multi-View Topic Models In this section we introduce multi-view topic models, or mview-LDA for short. [sent-64, score-0.275]

22 Our model, mviewLDA, views each document as the result of the interaction between its topical and idealogical dimensions. [sent-65, score-0.516]

23 The model seeks to explain lexical variabilities in the document by attributing this variabilities to one of those dimensions or to their interactions. [sent-66, score-0.409]

24 Topic models, like LDA, define a generative process for a document collection based on a set of parameters. [sent-67, score-0.266]

25 LDA employs a semantic entity known as topic to drive the generation of the document in question. [sent-68, score-0.48]

26 Each topic is represented by a topic-specific word distribution which is modeled as a multinomial distribution over words, denoted by Multi(β). [sent-69, score-0.333]

27 For each word (a) Draw a topic zn |θd ∼ Mult(θd). [sent-73, score-0.569]

28 of this vector defines how likely topic k will appear in document d. [sent-77, score-0.48]

29 For each word in the document wn, a topic indicator zn is sampled from θd, and then the word itself is sampled from a topic-specific word distribution specified by this indicator. [sent-78, score-0.774]

30 Thus LDA can capture and represent lexical variabilities via the components of θd which represents the topical content of the document. [sent-79, score-0.31]

31 In the next section we will explain how our new model mview-LDA can capture other sources of lexical variabilities beyond topical content. [sent-80, score-0.273]

32 1 Multi-View LDA As we noted earlier, LDA captures lexical variabilities due to topical content via θd and the set of topics β1:K. [sent-82, score-0.413]

33 In mview-LDA each document d is tagged with the ideological view it represents via the observed variable vd which takes values in the discrete range: {1, 2, · · · , V } as shown in Fig. [sent-83, score-0.813]

34 In addition, we utilize an ideologyspecific topic Ωv which is again a multinomial dis- × tribution over the same vocabulary. [sent-88, score-0.384]

35 tFoopri example, topic φv,k represents how ideology v addresses topic k. [sent-92, score-1.025]

36 The generative process of a document d with ideological view vd 1142 proceeds as follows: 1. [sent-93, score-0.842]

37 The bias of this coin determines the proportions of words in the document that are generated from its ideology background topic Ωvd. [sent-108, score-1.179]

38 As in LDA, we draw the document-specific topic proportion θd from a Dirichlet prior. [sent-109, score-0.366]

39 θd thus controls the lexical variabilities due to topical content inside the document. [sent-110, score-0.31]

40 To generate a word wn, we first generate a coin flip xn,1 from the coin ξd. [sent-111, score-0.238]

41 If it turns head, then we proceed to generate this word from the ideologyspecific topic associated with the document’s ideological view vd. [sent-112, score-0.662]

42 In this case, the word is drawn independently of the topical content of the document, and thus accounts for the lexical variation due to the ideology associated with the document. [sent-113, score-0.712]

43 Now, we have two choices: either to generate this word directly from the ideology-independent portion of the topic βzn , or to draw the word from the ideology-specific portion φvd,zn . [sent-116, score-0.482]

44 The choice here is not document specific, but rather depends on the interaction between the ideology and the specific topic in question. [sent-117, score-0.99]

45 If the ideology associated with the document holds a strong opinion or view with regard to this topic, then we expect that most of the time we will take the second choice, and generate wn from φvd,zn ; and vice versa. [sent-118, score-0.954]

46 Finally, it is worth mentioning that the decision to model λzn3 at the topic-ideology level rather than at the document level, as we have done with ξd, stems from our goal to capture ideology-specific behavior on a corpus level rather than capturing document- specific writing style. [sent-123, score-0.239]

47 Moreover, computing the frequency of the event xn,2 = 0 and zn = k gives the document’s bias toward topic k per se. [sent-126, score-0.714]

48 Finally, it is worth mentioning that all multinomial topics in the model: β, Ω, are generated once for the whole collection from a symmetric Dirichlet prior, similarly, all bias variables, λ1:K are sampled from a Beta distribution also once at the beginning of the generative process. [sent-127, score-0.361]

49 • Inference: Given a new document, and a point eIsntfiemraetnec eof: tGhiev meno ade nle parameters, ,fi andnd dth ae p posterior distribution of the latent variables associated with the document at hand: (θd, {xn,1}, {zn}, {xn,2}). [sent-132, score-0.285]

50 Under the generative process, and hyperparmaters choices, outlined in section 3, we seek to compute: P(d1:D, β1:K, Ω1:V, φ1:V,1:K, λ1:K|α, a, b, w, v), where d is a shorthand for the hidden variables (θd, ξd, z, x1, x2) in document d. [sent-134, score-0.264]

51 collapsing, the following hidden variables: the topic-mixing vectors θd and the ideology bias ξd for each document, as well as all the multinomial topic distributions: (β, Ω and φ) in addition to the ideology-topic biases given by the set of λ random variables. [sent-138, score-0.913]

52 Therefore, the state of the sampler at each iteration contains only the following topic indicators and coin flips for each document: (z, x1, x2). [sent-139, score-0.426]

53 At convergence, we can calculate expected values for all the parameters that were integrated out, especially for the topic distributions, for each document’s latent representation (mixing-vector) and for all coin biases. [sent-141, score-0.394]

54 For example, CWwkK gives the number of times word w was sampled from the ideology-independent portion of topic k. [sent-143, score-0.333]

55 As we mentioned in Section 3, to compute the ideology-bias in addressing a given topic say k in a given document, say d, we can simply compute the expected value of the event xn,2 = 0 and zn = k across posterior samples. [sent-153, score-0.652]

56 5 Data Sets We evaluated our model over three datasets: the bitterlemons croups and a two political blog-data set. [sent-154, score-0.208]

57 1 The Bitterlemons dataset The bitterlemons corpus consists of the articles published on the website http : / /bitt e rlemons . [sent-157, score-0.224]

58 Overall, the dataset contains 297 documents written from the Israeli’s point of view, and 297 documents written from the Palestinian’s point of view. [sent-164, score-0.216]

59 , 2009) collected blog posts from blog sites focusing on American politics during the period November 2007 to October 2008. [sent-172, score-0.235]

60 We selected three blog sites from this dataset: the Right Wing News (right-ideology) ; the Carpetbagger, and Daily Kos as representatives Israeli View US role Palestinian View Figure 2: Illustrating the big picture overview over the bitterlemons dataset using few topics. [sent-173, score-0.316]

61 Each box lists the top words in the corresponding multinomial topic distribution. [sent-174, score-0.333]

62 The second dataset refereed to as Blog-2 is similar to Blog-1 in its topical content and time frame but larger in its blog coverage (Eisenstein and Xing, 2010). [sent-182, score-0.398]

63 1 Visualization and Browsing One advantage of our approach is its ability to create a “big-picture” overview of the interaction between ideology and topics. [sent-195, score-0.51]

64 In figure 2 we show a portion of that diagram over the bitterlemons dataset. [sent-196, score-0.218]

65 First note how the ideology-specific topics in both ideology share the top-three words, which highlights that the two ideologies seek peace even though they still both disagree on other issues. [sent-197, score-0.68]

66 The figure gives example of three topics: the US role, the Roadmap peace process, and the Arab involvement in the conflict 1145 (the name of these topics were hand-crafted). [sent-198, score-0.224]

67 For each topic, we display the top words in the ideologyindependent part of the topic (β), along with top words in each ideology’s view of the topic (φ). [sent-199, score-0.697]

68 As we can see, the ideology-specific portion of the topic needn’t always represent a sentiment shared by its members toward a given topic, but it might rather includes extra important dimensions that need to be taken into consideration when addressing this topic. [sent-202, score-0.464]

69 Another interesting topic addresses the involvement of the neighboring Arab countries in the conflict. [sent-203, score-0.309]

70 The user can use the above chart as an entry point to retrieve various documents pertinent to a given topic or to a given view over a specific topic. [sent-206, score-0.498]

71 For instance, if the user asks for a representative sample of the Israeli(Palestinian) view with regard to the roadmap process, the system can first retrieve documents tagged with the Israeli(Palestinian) view and having a high topical value in their latent representation θ over this topic. [sent-207, score-0.592]

72 As we discussed in Section 4, this can be done by computing the expected value of the event xn,2 = 0 and zn = k where k is the topic under consideration. [sent-209, score-0.569]

73 Then given a test document, we predict its ideology using the following equation: vd = argmaxv∈VP(wd |v) ; (1) We use three baselines. [sent-214, score-0.747]

74 discLDA is a conditional model that divides the available number of topics into class-specific topics and shared-topics. [sent-225, score-0.206]

75 1K topics across ideologies and then divide the rest of the topics between ideologies4. [sent-227, score-0.257]

76 We should note from this figure that mview-LDA peaks at a small number of topics, however, each topic is represented by three multinomials. [sent-232, score-0.275]

77 Moreover, it is evident from the figure that the experiment over the blog2 dataset which measures each model’s ability to generalizes to a totally unseen new blog is a harder task than generalizing to unseen posts form the same blog. [sent-233, score-0.241]

78 , 2008) gave an optimization algorithm for learning the topic structure (transformation matrix), however since the code is not available, we resorted to one of the fixed splitting strategies mentioned in the paper. [sent-236, score-0.275]

79 no necessarily all) words from the ideology-specific parts of each topic when addressing this topic. [sent-239, score-0.308]

80 Finally, it should be noted that the bitterlemons dataset is a multi-author dataset and thus the models were tested on some authors that were not seen during training, however, two factors contributed to the good performance by all models over this dataset. [sent-240, score-0.288]

81 The first being the larger size of each document (740 words per document as compared to 200 words per post in blog-2) and the second being the more formal writing style in the bitterlemons dataset. [sent-241, score-0.57]

82 Full, refers to the full model; NoΩ refers to a model in which the ideology-specific background topic Ω is turned-off; and No-φ refers to a model in which the ideology-specific portions of the topics are turned-off. [sent-246, score-0.378]

83 In fact without φ the model has little power to discriminate between ideologies beyond the ideology-specific background topic Ω. [sent-248, score-0.326]

84 In this corpus each document is associated with a meta-topic that highlights the issues addressed in this document like: “A possible Jordanian role”, 1147 Figure 4: An Ablation study over the bitterlemons dataset. [sent-251, score-0.57]

85 We then used each document in the training set as a query to retrieve documents from the test set that address the same meta-topic in the query document but from the other-side’s perspective. [sent-257, score-0.582]

86 Note that we have access to the view of the query document but not the view of the test document. [sent-258, score-0.462]

87 Given a query document d, we rank documents in Figure 5: Evaluating the performance of the view-Retrieval task. [sent-262, score-0.356]

88 Intuitively, we would like θmdv−LDA−shared to reflect variation due to the topical content, but not ideological view of the document. [sent-272, score-0.507]

89 Finally we rank documents in the test set in a descending-order and evaluate the resulting ranking using three measures: the rank at full recall (lowest rank), average rank, and best rank of the ground-truth documents as they appear in the pre- dicted ranking. [sent-275, score-0.284]

90 (3) Unfortunately, the above scheme does not mix well because the value of the integrals in (2) are very low for any view other than the view of the document in the current state of the sampler. [sent-297, score-0.482]

91 This happens because of the tight coupling between vd and the indicators (x1, x2, z). [sent-298, score-0.304]

92 To generate a sample from qv∗ (), we run a few iterations of a restricted Gibbs scan over the docu- ment d conditioned on fixing vd = v∗ and then take the last sample jointly with v∗ as our proposed new ssttat sea. [sent-302, score-0.272]

93 8 Discussion and Future Work In this paper, we addressed the problem of modeling ideological perspective at a topical level. [sent-320, score-0.425]

94 We developed a factored topic model that we called multiView-LDA or mview-LDA for short. [sent-321, score-0.275]

95 mviewLDA factors a document collection into three set of topics: ideology-specific, topic-specific, and ideology-topic ones. [sent-322, score-0.237]

96 We showed that the resulting representation can be used to give a bird-eyes’ view to where each ideology stands with regard to mainstream topics. [sent-323, score-0.622]

97 Moreover, we illustrated how the latent structure induced by the model can by used to perform bias-detection at the document and topic level, and retrieve documents that represent alternative views. [sent-324, score-0.59]

98 It is important to mention that our model induces a hierarchical structure over the topics, and thus it is interesting to contrast it with hierarchical topic models like hLDA (Blei et al. [sent-325, score-0.275]

99 Second, the semantic of the hierarchical structure in our model is different than the one induced by those models since documents in our model are constrained to use a specific portion of the topic structure while in those models documents can freely sample words from any topic. [sent-329, score-0.485]

100 Hierarchical topic models and the nested Chinese restaurant process. [sent-477, score-0.275]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ideology', 0.475), ('zn', 0.294), ('topic', 0.275), ('vd', 0.272), ('ideological', 0.223), ('document', 0.205), ('topical', 0.171), ('bitterlemons', 0.16), ('wn', 0.128), ('coin', 0.119), ('palestinian', 0.113), ('view', 0.113), ('israeli', 0.107), ('bias', 0.105), ('topics', 0.103), ('variabilities', 0.102), ('blog', 0.092), ('draw', 0.091), ('lda', 0.087), ('roadmap', 0.085), ('documents', 0.076), ('abortion', 0.068), ('disclda', 0.068), ('mviewlda', 0.068), ('yano', 0.068), ('writer', 0.068), ('dataset', 0.064), ('gibbs', 0.059), ('liberal', 0.058), ('sentiment', 0.058), ('multinomial', 0.058), ('portion', 0.058), ('views', 0.054), ('sampling', 0.054), ('multi', 0.053), ('titov', 0.052), ('subjective', 0.051), ('idealogical', 0.051), ('ideologies', 0.051), ('ideologyspecific', 0.051), ('integrals', 0.051), ('peace', 0.051), ('staying', 0.051), ('posts', 0.051), ('blei', 0.05), ('posterior', 0.05), ('political', 0.048), ('collapsed', 0.046), ('blogs', 0.046), ('svm', 0.046), ('wd', 0.044), ('rank', 0.044), ('settlements', 0.044), ('eisenstein', 0.043), ('toward', 0.04), ('arab', 0.039), ('content', 0.037), ('dirichlet', 0.037), ('conflict', 0.036), ('bernoulli', 0.036), ('interaction', 0.035), ('pang', 0.034), ('retrieve', 0.034), ('absorbed', 0.034), ('cddkdkd', 0.034), ('ckk', 0.034), ('cwwkk', 0.034), ('dijk', 0.034), ('directs', 0.034), ('gelman', 0.034), ('ideologyindependent', 0.034), ('involvement', 0.034), ('mainstream', 0.034), ('pachinko', 0.034), ('palestinians', 0.034), ('refereed', 0.034), ('mentioning', 0.034), ('ling', 0.034), ('totally', 0.034), ('ablation', 0.034), ('opinion', 0.033), ('addressing', 0.033), ('wiebe', 0.033), ('indicators', 0.032), ('mining', 0.032), ('collection', 0.032), ('perspective', 0.031), ('query', 0.031), ('mei', 0.03), ('xing', 0.03), ('informed', 0.03), ('variables', 0.03), ('israelis', 0.029), ('fortuna', 0.029), ('trimming', 0.029), ('epx', 0.029), ('lebanon', 0.029), ('quantity', 0.029), ('accounts', 0.029), ('generative', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

2 0.23095496 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

3 0.16695549 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

Author: Christina Sauper ; Aria Haghighi ; Regina Barzilay

Abstract: In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.1

4 0.16352212 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

5 0.13412751 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

Author: Shafiq Joty ; Giuseppe Carenini ; Gabriel Murray ; Raymond T. Ng

Abstract: This work concerns automatic topic segmentation of email conversations. We present a corpus of email threads manually annotated with topics, and evaluate annotator reliability. To our knowledge, this is the first such email corpus. We show how the existing topic segmentation models (i.e., Lexical Chain Segmenter (LCSeg) and Latent Dirichlet Allocation (LDA)) which are solely based on lexical information, can be applied to emails. By pointing out where these methods fail and what any desired model should consider, we propose two novel extensions of the models that not only use lexical information but also exploit finer level conversation structure in a principled way. Empirical evaluation shows that LCSeg is a better model than LDA for segmenting an email thread into topical clusters and incorporating conversation structure into these models improves the performance significantly.

6 0.11333405 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

7 0.10965269 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

8 0.10641949 70 emnlp-2010-Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid

9 0.10568001 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

10 0.098230034 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

11 0.096464925 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

12 0.087578051 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

13 0.082634017 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

14 0.08186008 83 emnlp-2010-Multi-Level Structured Models for Document-Level Sentiment Classification

15 0.077284828 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

16 0.076791696 102 emnlp-2010-Summarizing Contrastive Viewpoints in Opinionated Text

17 0.072472833 77 emnlp-2010-Measuring Distributional Similarity in Context

18 0.069634773 84 emnlp-2010-NLP on Spoken Documents Without ASR

19 0.061594639 61 emnlp-2010-Improving Gender Classification of Blog Authors

20 0.061231084 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, 0.197), (2, -0.272), (3, -0.193), (4, 0.252), (5, 0.016), (6, 0.013), (7, -0.088), (8, -0.036), (9, -0.116), (10, 0.001), (11, 0.029), (12, -0.085), (13, 0.103), (14, 0.015), (15, 0.085), (16, -0.027), (17, 0.105), (18, 0.051), (19, -0.064), (20, 0.083), (21, 0.016), (22, 0.06), (23, 0.047), (24, 0.05), (25, 0.047), (26, 0.053), (27, 0.035), (28, -0.004), (29, 0.022), (30, -0.024), (31, -0.006), (32, 0.064), (33, -0.051), (34, 0.054), (35, 0.035), (36, 0.005), (37, 0.022), (38, -0.051), (39, -0.141), (40, 0.006), (41, 0.048), (42, 0.007), (43, -0.049), (44, -0.008), (45, 0.042), (46, -0.002), (47, -0.038), (48, -0.012), (49, -0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97098368 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

2 0.84080982 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

3 0.8077805 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

4 0.740996 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

Author: Zhiyuan Liu ; Wenyi Huang ; Yabin Zheng ; Maosong Sun

Abstract: Existing graph-based ranking methods for keyphrase extraction compute a single importance score for each word via a single random walk. Motivated by the fact that both documents and words can be represented by a mixture of semantic topics, we propose to decompose traditional random walk into multiple random walks specific to various topics. We thus build a Topical PageRank (TPR) on word graph to measure word importance with respect to different topics. After that, given the topic distribution of the document, we further calculate the ranking scores of words and extract the top ranked ones as keyphrases. Experimental results show that TPR outperforms state-of-the-art keyphrase extraction methods on two datasets under various evaluation metrics.

5 0.67648107 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

Author: Shafiq Joty ; Giuseppe Carenini ; Gabriel Murray ; Raymond T. Ng

Abstract: This work concerns automatic topic segmentation of email conversations. We present a corpus of email threads manually annotated with topics, and evaluate annotator reliability. To our knowledge, this is the first such email corpus. We show how the existing topic segmentation models (i.e., Lexical Chain Segmenter (LCSeg) and Latent Dirichlet Allocation (LDA)) which are solely based on lexical information, can be applied to emails. By pointing out where these methods fail and what any desired model should consider, we propose two novel extensions of the models that not only use lexical information but also exploit finer level conversation structure in a principled way. Empirical evaluation shows that LCSeg is a better model than LDA for segmenting an email thread into topical clusters and incorporating conversation structure into these models improves the performance significantly.

6 0.66597128 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

7 0.57590503 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

8 0.53480047 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

9 0.42887571 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

10 0.39222422 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

11 0.36305276 83 emnlp-2010-Multi-Level Structured Models for Document-Level Sentiment Classification

12 0.32790115 70 emnlp-2010-Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid

13 0.31732237 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

14 0.31485951 102 emnlp-2010-Summarizing Contrastive Viewpoints in Opinionated Text

15 0.30854401 84 emnlp-2010-NLP on Spoken Documents Without ASR

16 0.29429254 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

17 0.28305072 77 emnlp-2010-Measuring Distributional Similarity in Context

18 0.27539566 61 emnlp-2010-Improving Gender Classification of Blog Authors

19 0.2627596 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

20 0.25563642 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.013), (12, 0.023), (29, 0.091), (30, 0.428), (32, 0.012), (52, 0.023), (56, 0.114), (62, 0.02), (66, 0.08), (72, 0.044), (76, 0.033), (79, 0.014), (87, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92649287 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

Author: Tara McIntosh

Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.

same-paper 2 0.81320822 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

3 0.787467 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.

4 0.56087291 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

Author: Quang Do ; Dan Roth

Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.

5 0.5079422 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

6 0.5026052 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

7 0.49272564 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

8 0.49108577 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

9 0.49031079 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

10 0.46946368 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

11 0.46568447 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

12 0.46560383 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

13 0.45473975 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

14 0.45326701 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

15 0.45033753 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

16 0.4442406 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

17 0.44275069 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

18 0.44084483 51 emnlp-2010-Function-Based Question Classification for General QA

19 0.43352154 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

20 0.43149325 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars