emnlp emnlp2010 emnlp2010-48 knowledge-graph by maker-knowledge-mining

48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

Source: pdf

Author: Shafiq Joty ; Giuseppe Carenini ; Gabriel Murray ; Raymond T. Ng

Abstract: This work concerns automatic topic segmentation of email conversations. We present a corpus of email threads manually annotated with topics, and evaluate annotator reliability. To our knowledge, this is the first such email corpus. We show how the existing topic segmentation models (i.e., Lexical Chain Segmenter (LCSeg) and Latent Dirichlet Allocation (LDA)) which are solely based on lexical information, can be applied to emails. By pointing out where these methods fail and what any desired model should consider, we propose two novel extensions of the models that not only use lexical information but also exploit finer level conversation structure in a principled way. Empirical evaluation shows that LCSeg is a better model than LDA for segmenting an email thread into topical clusters and incorporating conversation structure into these models improves the performance significantly.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Dreepnairtnmie,nt gofa Cbromiepultmer, ,S crienngc}e@ University of British Columbia Vancouver, BC, V6T 1Z4, Canada , , Abstract This work concerns automatic topic segmentation of email conversations. [sent-4, score-0.758]

2 We present a corpus of email threads manually annotated with topics, and evaluate annotator reliability. [sent-5, score-0.526]

3 To our knowledge, this is the first such email corpus. [sent-6, score-0.414]

4 We show how the existing topic segmentation models (i. [sent-7, score-0.344]

5 Empirical evaluation shows that LCSeg is a better model than LDA for segmenting an email thread into topical clusters and incorporating conversation structure into these models improves the performance significantly. [sent-11, score-0.888]

6 Effective processing of the email contents can be of great strategic value. [sent-13, score-0.414]

7 In this paper, we study the problem of topic segmentation for emails, i. [sent-14, score-0.344]

8 , grouping the sentences of an email thread into a set of coherent topical clusters. [sent-16, score-0.663]

9 Adapting the standard definition of topic (Galley et al. [sent-17, score-0.24]

10 , 2003) to conversations/emails, we consider a topic is something about which the participant(s) discuss or argue or 388 express their opinions. [sent-18, score-0.24]

11 For example, in the email thread shown in Figure 1, according to the majority of our annotators, participants discuss three topics (e. [sent-19, score-0.78]

12 While extensive research has been conducted in topic segmentation for monologues (e. [sent-34, score-0.41]

13 Also, it is our key hypothesis that, because of its asynchronous nature, and the use of quotation (Crystal, 2001), topics in an email thread often do not change in a sequential way. [sent-45, score-0.977]

14 As a result, we do not expect models which have proved successful in monologue or dialog to be as effective when they are applied to email conversations. [sent-46, score-0.459]

15 First, we present an email corpus annotated with topics and evaluate annotator agreement. [sent-50, score-0.531]

16 Third, we show how the two state-of-the-art topic segmentation methods (i. [sent-52, score-0.344]

17 , LCSeg and LDA) which are solely based on lexical information and make strong assumptions on the resulting topic models, can be effectively applied to emails, by having them to consider, in a principled way, a finer level structure of the underlying conversations. [sent-54, score-0.289]

18 2 Related Work Three research areas are directly related to our study: a) text segmentation models, b) probabilistic topic models, and c) extracting and representing the conversation structure of emails. [sent-57, score-0.531]

19 For the topic level, they achieve similar results as (Galley et al. [sent-81, score-0.24]

20 The probabilistic generative topic models, such as LDA and its variants (e. [sent-86, score-0.24]

21 , 2003), (Steyvers and Griffiths, 2007)), have proven to be successful for topic segmentation in both monologue (e. [sent-89, score-0.389]

22 , 2006) uses a variant of LDA for the tasks of segmenting meeting transcripts and extracting the associated topic labels. [sent-97, score-0.278]

23 In our work, we show how the general LDA performs when applied to email conversations and describe how it can be extended to exploit the conversation structure of emails. [sent-99, score-0.632]

24 Several approaches have been proposed to capture an email conversation . [sent-100, score-0.601]

25 , 2007) present a method to capture an email conversation at the finer level by analyzing the embedded quotations in emails. [sent-106, score-0.68]

26 A fragment quotation graph (FQG) is generated, which is shown to be beneficial for email summarization. [sent-107, score-0.664]

27 In this paper, we show that topic segmentation models can also benefit significantly from this fine conversation structure of email threads. [sent-108, score-0.972]

28 3 Corpus and Evaluation Metrics There are no publicly available email corpora annotated with topics. [sent-109, score-0.414]

29 We have annotated the BC3 email corpus (Ulrich et al. [sent-111, score-0.414]

30 The BC3 corpus, previously annotated with sentence level speech acts, meta sentence, subjectivity, extractive and abstractive summaries, is one of a growing number of corpora being used for email research. [sent-113, score-0.414]

31 The corpus contains 40 email threads from the W3C corpus2. [sent-114, score-0.526]

32 In the example email thread shown in Figure 1, a schism takes place when people discuss about ‘responding to I18N’ . [sent-122, score-0.676]

33 All the annotators do not agree on the fact that the topic about ‘responding to I18N’ swerves from the one about ‘TAG document’ . [sent-123, score-0.351]

34 , some are specific and some are general), and on the topic assignment of the sentences3. [sent-126, score-0.24]

35 For the pilot study we picked five email threads randomly from the corpus. [sent-128, score-0.587]

36 In 1The BC3 corpus had already been annotated for email summarization, speech act recognition and subjectivity detection. [sent-131, score-0.414]

37 html 3The annotators also disagree on the topic labels, however in this work we are not interested in finding the topic labels. [sent-139, score-0.588]

38 BC3 contains three human written abstract summaries for each email thread. [sent-144, score-0.414]

39 With each email thread the annotators were also given an associated human written summary to give a brief overview of the corresponding conversation. [sent-145, score-0.715]

40 In the first phase, the annotators read the conversation and the associated summary and list the topics discussed. [sent-147, score-0.382]

41 The target number of topics and the topic labels were not given in advance and they were instructed to find as many topics as needed to convey the overall content structure of the conversation. [sent-151, score-0.474]

42 In the second phase the annotators identify the most appropriate topic for each sentence. [sent-152, score-0.318]

43 If they find any sentence that does not fit into any topic, they are told to label those as the predefined topic ‘OFF-TOPIC’. [sent-154, score-0.24]

44 , ‘hi’, ‘hello’) signifies the section (usually at the beginning) of an email that people use to begin their email. [sent-158, score-0.414]

45 Table 1 shows some basic statistics computed on the three annotations of the 39 email threads5. [sent-164, score-0.45]

46 On 4The annotators in the pilot and in the actual study were different so we could reuse the threads used in pilot study. [sent-165, score-0.312]

47 However, one thread on which the pilot annotators agree fully, was used as an example in the instruction manual. [sent-166, score-0.421]

48 These statistics (number of topics and topic density) indicate that the dataset is suitable for topic segmentation. [sent-177, score-0.597]

49 Again, our problem of topic segmentation for emails is not sequential in nature. [sent-186, score-0.466]

50 Therefore, the standard metrics widely used in sequential topic segmentation for monologues and dialogs, such as Pk and WindowDiff(WD), are also not applicable. [sent-187, score-0.44]

51 To compute the loc3 metric for the m-th sentence in the two annotations, we consider the previous 3 sentences: m-1, m-2 and m-3, and mark them as either ‘same’ or ‘different’ depending on their topic assignment. [sent-195, score-0.24]

52 If we consider the topic of a randomly picked sentence as a random variable then its entropy measures the level of detail in an annotation. [sent-199, score-0.24]

53 4 Topic Segmentation Models Developing automatic tools for segmenting an email thread is challenging. [sent-212, score-0.675]

54 The example email thread in Figure 1 demonstrates why. [sent-213, score-0.637]

55 One can notice that email conversations are different from written monologues (e. [sent-215, score-0.511]

56 As a communication media Email is distributed (unlike face to face meeting) and asynchronous (unlike 60 uncertainty happens when there is only one topic found 7hence we do not use it to compare our models. [sent-220, score-0.319]

57 Green represents topic 1 (‘telecon cancellation’), orange indicates topic 2 (‘TAG document’) and magenta represents topic 3 (‘responding to I18N’) chat), meaning that different people from different locations can collaborate at different times. [sent-222, score-0.72]

58 Therefore, topics in an email thread may not change in sequential way. [sent-223, score-0.754]

59 These properties of email limit the application of techniques that have been successful in monologues and dialogues. [sent-234, score-0.48]

60 LDA and LCSeg are the two state-of-the-art models for topic segmentation of multi-party conversation (e. [sent-235, score-0.531]

61 In this section, at first we describe how the existing models of topic segmentation can be applied to emails. [sent-241, score-0.344]

62 We then point out where these methods fail and propose extensions of these basic models for email conversations. [sent-242, score-0.414]

63 This model relies on the fundamental idea that documents are mixtures of topics, and a topic is a multinomial distribution over words. [sent-245, score-0.268]

64 The generative topic model specifies the following distribution over words within a document: XT P(wi) = XP(wi|zi = j)P(zi = j) Xj=1 Where T is the number of topics. [sent-246, score-0.24]

65 P(wi |zi = j) is the probability of word wi under topic j an|zd P(zi = j) is the probability that jth topic was sampled for the ith word token. [sent-247, score-0.524]

66 This framework can be directly applied to an email thread by considering each email as a document. [sent-256, score-1.051]

67 By assuming wthe words in a sentence occur independently we can estimate the topic assignments for sentences as follows: P(zi = j|sk) = Y P(zi = j|wi) wiY Y∈sk where, sk is the kth sentence for which we can assign the topic by: j∗ = argmaxjP(zi = j |sk). [sent-260, score-0.509]

68 LCSeg assumes that topic shifts are likely to occur where strong term repetitions start and end9. [sent-264, score-0.24]

69 Low similarity indicates low lexical cohesion, and a sharp change signals a high probability of an actual topic boundary. [sent-270, score-0.24]

70 In order to apply LCSeg on email threads we arrange the emails based on their temporal relation (i. [sent-272, score-0.648]

71 , arrival time) and apply the LCSeg algorithm to get the topic boundaries. [sent-274, score-0.24]

72 I'm prepared to decide by email so we can formally respond by email. [sent-654, score-0.451]

73 > Im' prepared to decide by email so we can formally respond by email. [sent-1038, score-0.451]

74 3 Limitation of Existing Approaches The main limitation of the two models discussed above is that they take the bag-of-words (BOW) assumption without considering the fact that an email thread is a multi-party, asynchronous conversation10. [sent-1341, score-0.716]

75 However, we argue that these models are still inadequate for finding topics in emails especially when topics are closely related (e. [sent-1353, score-0.356]

76 To better identify the topics in an email thread we need to consider the email specific conversation features (e. [sent-1356, score-1.355]

77 Specifically, we need to capture the conversation structure at the fragment (quotation) level and to incorporate this structure into our models. [sent-1364, score-0.265]

78 In the next section, we describe how one can capture the conversation structure at the fragment level in the form of Fragment Quotation Graph (henceforth, FQG). [sent-1365, score-0.265]

79 6 respectively, we show how the LDA and LCSeg models can be extended so that they take this conversation structure into account for topic segmentation. [sent-1368, score-0.427]

80 4 Extracting Conversation Structure We demonstrate how to build a FQG through the example email thread involving 7 emails shown in Figure 1. [sent-1370, score-0.759]

81 , quotation depth > 0) fragments based on the usage of quotation (>) marks. [sent-1376, score-0.411]

82 For instance, email E3 contains two new fragments (f, g), and two quoted fragments (d, e) of depth 1. [sent-1377, score-0.697]

83 For example, de in E2 is divided into d and e distinct fragments when compared with the fragments of E3. [sent-1381, score-0.246]

84 If an email does not contain quotes then the fragments of that email are connected to the fragments of the source email to which it replies. [sent-1390, score-1.488]

85 The advantage of the FQG is that it captures the conversation at finer granularity level in contrast to the structure found by the ‘reply-to’ relation at the email level, which would be merely a sequence from 394 E1 to E7 in this example. [sent-1391, score-0.65]

86 Hidden fragments are quoted fragments (shaded fragment m in fig 2 which corresponds to the fragment made bold in fig 1), whose original email is missing in the user’s inbox. [sent-1393, score-0.853]

87 , 2007) study this phenomenon and its impact on email summarization in detail. [sent-1395, score-0.414]

88 The first step forwards this aim is to regularize the topic-word distribution with a word network such that two connected words get similar topic distributions. [sent-1399, score-0.323]

89 Implicitly by doing this we want two sentences in the same or adjacent fragments to have similar topic distributions, and fall = in the same topical cluster. [sent-1429, score-0.389]

90 6 LCSeg with FQG If we examine the FQG carefully, different paths (considering the fragments of the first email as root nodes) can be interpreted as subconversations. [sent-1431, score-0.537]

91 All same is optimal for threads containing only one topic, but its performance rapidly degrades as the number of topics in a thread increases. [sent-1484, score-0.452]

92 54 11For a fair comparison of the systems we set the same topic number per thread for all of them. [sent-1488, score-0.463]

93 If at least two of the annotators agree on the topic number we set that number, otherwise we set the floor value of the average topic number. [sent-1489, score-0.591]

94 12The maximum value of 1 is due to the fact that for some threads some annotators found only one topic 396 (max: 1, min: 0. [sent-1491, score-0.43]

95 A comparison of the basic LCSeg with the basic LDA reveals that LCSeg is a better model for email topic segmentation (p=0. [sent-1506, score-0.758]

96 7 Conclusion In this paper we presented an email corpus annotated for topic segmentation. [sent-1538, score-0.654]

97 Empirical 397 evaluation shows that the fragment quotation graph helps both these models to perform significantly better than their basic versions, with LCSeg+FQG being the best performer. [sent-1540, score-0.25]

98 Incorporating domain knowledge into topic modeling via dirichlet forest priors. [sent-1548, score-0.292]

99 A comparative study of mixture models for automatic topic segmentation of multiparty dialogues. [sent-1636, score-0.378]

100 A publicly available annotated corpus for supervised email summarization. [sent-1707, score-0.414]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lcseg', 0.473), ('email', 0.414), ('fqg', 0.249), ('topic', 0.24), ('lda', 0.227), ('thread', 0.223), ('conversation', 0.187), ('telecon', 0.158), ('quotation', 0.144), ('fragments', 0.123), ('emails', 0.122), ('topics', 0.117), ('threads', 0.112), ('segmentation', 0.104), ('feb', 0.092), ('zi', 0.082), ('asynchronous', 0.079), ('fragment', 0.078), ('galley', 0.078), ('annotators', 0.078), ('hsueh', 0.066), ('monologues', 0.066), ('pilot', 0.061), ('carenini', 0.056), ('regularize', 0.056), ('responding', 0.056), ('cancellation', 0.053), ('elsner', 0.052), ('dirichlet', 0.052), ('finer', 0.049), ('brian', 0.048), ('griffiths', 0.048), ('participant', 0.047), ('dialogs', 0.045), ('monologue', 0.045), ('date', 0.045), ('wi', 0.044), ('cut', 0.041), ('georgescul', 0.039), ('intro', 0.039), ('lock', 0.039), ('nserc', 0.039), ('rdf', 0.039), ('schism', 0.039), ('thu', 0.039), ('wed', 0.039), ('segmenting', 0.038), ('jeremy', 0.037), ('quoted', 0.037), ('respond', 0.037), ('subject', 0.037), ('steyvers', 0.037), ('annotations', 0.036), ('chat', 0.035), ('min', 0.035), ('segmenter', 0.034), ('max', 0.034), ('deadline', 0.034), ('cohesion', 0.034), ('bow', 0.034), ('multiparty', 0.034), ('andrzejewski', 0.034), ('malioutov', 0.034), ('subtopic', 0.034), ('chain', 0.033), ('agreement', 0.033), ('agree', 0.033), ('speaker', 0.032), ('dt', 0.032), ('conversations', 0.031), ('wg', 0.03), ('quotations', 0.03), ('draft', 0.03), ('disagree', 0.03), ('repetition', 0.03), ('metrics', 0.03), ('ny', 0.029), ('sk', 0.029), ('chains', 0.029), ('charniak', 0.028), ('multinomial', 0.028), ('blei', 0.028), ('graph', 0.028), ('fine', 0.027), ('network', 0.027), ('response', 0.026), ('topical', 0.026), ('participants', 0.026), ('aoki', 0.026), ('assoc', 0.026), ('charmod', 0.026), ('crystal', 0.026), ('disentanglement', 0.026), ('freddy', 0.026), ('ihave', 0.026), ('instruction', 0.026), ('itell', 0.026), ('ithink', 0.026), ('iunderstood', 0.026), ('johanna', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

Author: Shafiq Joty ; Giuseppe Carenini ; Gabriel Murray ; Raymond T. Ng

2 0.15037492 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

Author: Chen Zhang ; Joyce Chai

Abstract: While a significant amount of research has been devoted to textual entailment, automated entailment from conversational scripts has received less attention. To address this limitation, this paper investigates the problem of conversation entailment: automated inference of hypotheses from conversation scripts. We examine two levels of semantic representations: a basic representation based on syntactic parsing from conversation utterances and an augmented representation taking into consideration of conversation structures. For each of these levels, we further explore two ways of capturing long distance relations between language constituents: implicit modeling based on the length of distance and explicit modeling based on actual patterns of relations. Our empirical findings have shown that the augmented representation with conversation structures is important, which achieves the best performance when combined with explicit modeling of long distance relations.

3 0.14410406 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

4 0.13412751 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

5 0.11891365 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

6 0.094784841 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

7 0.089694925 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

8 0.088911556 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

9 0.084287204 84 emnlp-2010-NLP on Spoken Documents Without ASR

10 0.078666821 77 emnlp-2010-Measuring Distributional Similarity in Context

11 0.072871193 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

12 0.063987464 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

13 0.059025016 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

14 0.056505535 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

15 0.056437533 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

16 0.052239835 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

17 0.051180612 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

18 0.050699916 70 emnlp-2010-Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid

19 0.048458945 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

20 0.044199422 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.177), (1, 0.156), (2, -0.154), (3, -0.121), (4, 0.082), (5, 0.009), (6, -0.035), (7, -0.134), (8, -0.057), (9, -0.093), (10, -0.133), (11, -0.176), (12, -0.137), (13, 0.142), (14, -0.009), (15, -0.017), (16, -0.008), (17, 0.023), (18, 0.09), (19, -0.097), (20, 0.139), (21, -0.0), (22, -0.018), (23, 0.076), (24, 0.108), (25, 0.097), (26, -0.04), (27, 0.136), (28, 0.214), (29, 0.057), (30, 0.077), (31, 0.038), (32, -0.067), (33, -0.187), (34, 0.022), (35, 0.007), (36, 0.108), (37, -0.064), (38, 0.121), (39, 0.039), (40, -0.006), (41, -0.068), (42, -0.062), (43, -0.012), (44, 0.18), (45, 0.046), (46, -0.071), (47, -0.01), (48, 0.008), (49, -0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95274484 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

Author: Shafiq Joty ; Giuseppe Carenini ; Gabriel Murray ; Raymond T. Ng

2 0.62484217 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

Author: Zhiyuan Liu ; Wenyi Huang ; Yabin Zheng ; Maosong Sun

Abstract: Existing graph-based ranking methods for keyphrase extraction compute a single importance score for each word via a single random walk. Motivated by the fact that both documents and words can be represented by a mixture of semantic topics, we propose to decompose traditional random walk into multiple random walks specific to various topics. We thus build a Topical PageRank (TPR) on word graph to measure word importance with respect to different topics. After that, given the topic distribution of the document, we further calculate the ranking scores of words and extract the top ranked ones as keyphrases. Experimental results show that TPR outperforms state-of-the-art keyphrase extraction methods on two datasets under various evaluation metrics.

3 0.60838544 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

Author: Chen Zhang ; Joyce Chai

4 0.54071969 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

5 0.51043469 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

Author: Daniel Walker ; William B. Lund ; Eric K. Ringger

Abstract: Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unpro- cessed OCR output in the case of LDA. To our knowledge, this study is the first of its kind.

6 0.4687703 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

7 0.45661888 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

8 0.3797361 84 emnlp-2010-NLP on Spoken Documents Without ASR

9 0.32752222 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

10 0.27608752 77 emnlp-2010-Measuring Distributional Similarity in Context

11 0.25085652 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

12 0.24221897 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

13 0.23630691 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

14 0.21601012 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

15 0.21572313 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

16 0.20451537 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

17 0.20018347 102 emnlp-2010-Summarizing Contrastive Viewpoints in Opinionated Text

18 0.1995122 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

19 0.18189323 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

20 0.17910975 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.015), (10, 0.014), (12, 0.026), (29, 0.097), (30, 0.035), (32, 0.01), (52, 0.027), (56, 0.093), (62, 0.013), (64, 0.01), (66, 0.072), (72, 0.058), (76, 0.024), (79, 0.385), (87, 0.019), (89, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77614892 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

Author: Shafiq Joty ; Giuseppe Carenini ; Gabriel Murray ; Raymond T. Ng

2 0.77419156 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

Author: Jackie Chi Kit Cheung ; Gerald Penn

Abstract: Syntactic consistency is the preference to reuse a syntactic construction shortly after its appearance in a discourse. We present an analysis of the WSJ portion of the Penn Treebank, and show that syntactic consistency is pervasive across productions with various lefthand side nonterminals. Then, we implement a reranking constituent parser that makes use of extra-sentential context in its feature set. Using a linear-chain conditional random field, we improve parsing accuracy over the generative baseline parser on the Penn Treebank WSJ corpus, rivalling a similar model that does not make use of context. We show that the context-aware and the context-ignorant rerankers perform well on different subsets of the evaluation data, suggesting a combined approach would provide further improvement. We also compare parses made by models, and suggest that context can be useful for parsing by capturing structural dependencies between sentences as opposed to lexically governed dependencies.

3 0.41454878 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

Author: Chen Zhang ; Joyce Chai

4 0.41016707 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

5 0.40748599 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

Author: Wei Lu ; Hwee Tou Ng

Abstract: This paper focuses on the task of inserting punctuation symbols into transcribed conversational speech texts, without relying on prosodic cues. We investigate limitations associated with previous methods, and propose a novel approach based on dynamic conditional random fields. Different from previous work, our proposed approach is designed to jointly perform both sentence boundary and sentence type prediction, and punctuation prediction on speech utterances. We performed evaluations on a transcribed conversational speech domain consisting of both English and Chinese texts. Empirical results show that our method outperforms an approach based on linear-chain conditional random fields and other previous approaches.

6 0.39909151 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

7 0.39851594 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

8 0.39705619 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

9 0.39153621 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

10 0.39027599 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

11 0.38731506 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

12 0.38642722 13 emnlp-2010-A Simple Domain-Independent Probabilistic Approach to Generation

13 0.38191685 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

14 0.38160527 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

15 0.37998682 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

16 0.37979469 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

17 0.37892786 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

18 0.37699255 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

19 0.37593582 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

20 0.37148634 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text