acl acl2011 acl2011-287 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hongning Wang ; Duo Zhang ; ChengXiang Zhai
Abstract: Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. ,
Reference: text
sentIndex sentText sentNum sentScore
1 However, existing topic models generally cannot capture the latent topical structures in documents. [sent-2, score-0.635]
2 Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. [sent-3, score-0.504]
3 In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. [sent-4, score-1.178]
4 Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. [sent-5, score-0.458]
5 , 1 Introduction A great amount of effort has recently been made in applying statistical topic models (Hofmann, 1999; Blei et al. [sent-6, score-0.309]
6 In general, topic models can discover word clustering patterns in documents and project each document to a latent topic space formed by such word clusters. [sent-14, score-0.846]
7 , the document generation probabilities are invariant to content permutation. [sent-22, score-0.16]
8 Ignoring such latent topical structures inside the documents means wasting valuable clues about topics and thus would lead to non-optimal topic modeling. [sent-24, score-0.829]
9 Taking apartment rental advertisements as an example, when people write advertisements for their apartments, it’s natural to first introduce “size ” and “address” of the apartment, and then “rent” and “contact”. [sent-25, score-0.195]
10 If this kind of topical structures are captured by a topic model, it would not only improve the topic mining results, but, more importantly, also help many other document analysis tasks, such as sentence annotation and sentence ordering. [sent-27, score-1.14]
11 Nevertheless, very few existing topic models attempted to model such structural dependency among topics. [sent-28, score-0.356]
12 However, Aspect HMM separately estimates the topics in the training set and depends on heuristics to infer the transitional relations between topics. [sent-30, score-0.237]
13 , 2007) extends the traditional topic models by assuming words in each sentence share the same topic assignment, and topics transit between adjacent sentences. [sent-32, score-0.88]
14 , how likely one topic would follow another topic, are not captured in this model. [sent-35, score-0.336]
15 c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s526–1535, In this paper, we propose a new topic model, named Structural Topic Model (strTM) to model and analyze both latent topics and topical structures in text documents. [sent-38, score-0.789]
16 To do so, strTM assumes: 1) words in a document are either drawn from a content topic or a functional (i. [sent-39, score-0.543]
17 , background) topic; 2) words in the same sentence share the same content topic; and 3) content topics in the adjacent sentences follow a topic transition that satisfies the first order Markov property. [sent-41, score-0.828]
18 To evaluate the usefulness of the identified topical structures by strTM, we applied strTM to the tasks of sentence annotation and sentence ordering, where correctly modeling the document structure is crucial. [sent-43, score-0.552]
19 On the corpus of 8,03 1 apartment advertisements from craiglist (Grenager et al. [sent-44, score-0.139]
20 , 2006), strTM achieved encouraging improvement in both tasks compared with the baseline methods that don’t explicitly model the topical structure. [sent-46, score-0.217]
21 The results confirm the necessity of modeling the latent topical structures inside documents, and also demonstrate the advantages of the proposed strTM over existing topic models. [sent-47, score-0.659]
22 , 2007), document summarization (Lu and Zhai, 2008) and image annotation (Blei and Jordan, 2003). [sent-51, score-0.138]
23 However, in most existing work, the dependency among the topics is loosely governed by the prior topic distribution, e. [sent-52, score-0.494]
24 Correlated Topic Model (Blei and Lafferty, 2007) replaces Dirichlet prior with logistic Normal prior for topic distribution in each document in order to capture the correlation between the topics. [sent-56, score-0.475]
25 But in 1527 HMM-LDA, only the latent variables for the syntactic classes are treated as a locally dependent sequence, while latent topics are treated the same as in other topic models. [sent-59, score-0.641]
26 introduced the generalized Mallows model to constrain the latent topic assignments (Chen et al. [sent-61, score-0.372]
27 In their model, they assume there exists a canonical order among the topics in the collection of related documents and the same topics are forced not to appear in disconnected portions of the topic sequence in one document (sampling without replacement). [sent-63, score-0.761]
28 Our method relaxes this assumption by only postulating transitional dependency between topics in the adjacent sentences (sampling with replacement) and thus potentially allows a topic to appear multiple times in disconnected segments. [sent-64, score-0.638]
29 HTMM models the document structure by assuming words in the same sentence share the same topic assignment and successive sentences are more likely to share the same topic. [sent-67, score-0.615]
30 However, HTMM only loosely models the transition between topics as a binary relation: the same as the previous sentence’s assignment or draw a new one with a certain probability. [sent-68, score-0.303]
31 In contrast, our strTM model explicitly captures the regular topic transitions by postulating the first order Markov property over the topics. [sent-70, score-0.453]
32 A deficiency of the content models is that the identification of clusters of text spans is done separately from transition modeling. [sent-75, score-0.192]
33 Our strTM addresses this deficiency by defining a generative process to simultaneously capture the topics and the transitional relationship among topics: allowing topic modeling and transition modeling to reinforce each other in a principled framework. [sent-76, score-0.73]
34 3 Structural Topic Model In this section, we formally define the Structural Topic Model (strTM) and discuss how it captures the latent topics and topical structures within the documents simultaneously. [sent-77, score-0.551]
35 From the theory of linguistic analysis (Kamp, 1981), we know that document exhibits internal structures, where structural segments encapsulate semantic units that are closely related. [sent-78, score-0.151]
36 In strTM, we treat a sentence as the basic structure unit, and assume all the words in a sentence share the same topical aspect. [sent-79, score-0.372]
37 Besides, two adjacent segments are assumed to be highly related (capturing cohesion in text); specifically, in strTM we pose a strong tran- sitional dependency assumption among the topics: the choice of topic for each sentence directly depends on the previous sentence’s topic assignment, i. [sent-80, score-0.724]
38 Moveover, taking the insights from HMM-LDA that not all the words are content conveying (some of them may just be a result of syntactic requirement), we introduce a dummy functional topic zB for every sentence in the document. [sent-83, score-0.463]
39 We use this functional topic to capture the document-independent word distribution, i. [sent-84, score-0.36]
40 As a result, in strTM, every sentence is treated as a mixture of content and functional topics. [sent-88, score-0.214]
41 Formally, we assume a corpus consists of D documents with a vocabulary of size V, and there are k content topics embedded in the corpus. [sent-89, score-0.271]
42 In a given document d, there are m sentences and each sentence ihas Ni words. [sent-90, score-0.182]
43 We assume the topic transition probability p(z|z′) is drawn from a Multinomial distributaiboinli Mul(αz′), and the word emission probability under each topic p(w|z) is drawn from a Multinomial ddiesrtr eiabcuhtio tonp Mul(βz). [sent-91, score-0.88]
44 To get a unified description of the generation process, we add another dummy topic T-START in strTM, which is the initial topic with position “-1” for every document but does not emit any words. [sent-92, score-0.722]
45 In addition, since our functional topic is assumed to occur in all the sentences, we don’t need to model its transition with other content topics. [sent-93, score-0.53]
46 We use a Binomial variable π to control the proportion be1528 tween content and functional topics in each sentence. [sent-94, score-0.261]
47 Therefore, there are k+1 topic transitions, one for T-START and others for k content topics; and k emission probabilities for the content topics, with an additional one for the functional topic zB (in total k+1 emission probability distributions). [sent-95, score-0.919]
48 , Sm, z|α,β, π) = ∏p(zi|α, zi−1)p(Si|zi) ∏i=1 (1) where the topic to sentence emission probability is defined as: p(Si|zi) =∏Ni ∏j=0 [πp(wij|β,zi) + (1 − π)p(wij|β,zB)] (2]) This process is graphically illustrated in Figure 1. [sent-100, score-0.436]
49 From the definition of strTM, we can see that the document structure is characterized by a documentspecific topic chain, and forcing the words in one sentence to share the same content topic ensures semantic cohesion of the mined topics. [sent-102, score-0.94]
50 Although we do not directly model the topic mixture for each document as the traditional topic models do, the word co-occurrence patterns within the same document are captured by topic propagation through the transitions. [sent-103, score-1.196]
51 This can be easily understood when we write down the posterior probability of the topic assignment for a particular sentence: p(zi |S0, S1, . [sent-104, score-0.388]
52 ,Sm|zi+1)p(zi+1|zi) (3) z∑i+1 The first part of Eq(3) describes the recursive in- fluence on the choice of topic for the ith sentence from its preceding sentences, while the second part captures how the succeeding sentences affect the current topic assignment. [sent-125, score-0.727]
53 Intuitively, when we need to decide a sentence’s topic, we will look “backward” and “forward” over all the sentences in the document to determine a “suitable” one. [sent-126, score-0.135]
54 In addition, because of the first order Markov property, the local topical dependency gets more emphasis, i. [sent-127, score-0.217]
55 This result is reasonable, especially in a long document, since neighboring sentences are more likely to cover similar topics than two sentences far apart. [sent-131, score-0.216]
56 In the E-Step of EM algorithm, we need to collect the expected count of a sequential topic pair (z, z′) and a topic-word pair (z, w) to update the model parameters α and β in the M-Step. [sent-135, score-0.309]
57 In strTM, words in one sentence are independently drawn from either a specific content topic z or functional topic zB according to the mixture weight π. [sent-139, score-0.829]
58 Since we already observe topic z and sentence s cooccur with probability p(s, z|d, Θ), each word w ionc s s whiotuhld p oshbaarbei tthye same probability oofr d be wing observed with content topic z. [sent-142, score-0.765]
59 it Hh otwhee vfuern,c stiioncneal topic zB, ntchee word w may also be drawn from zB. [sent-145, score-0.332]
60 Thus we introduce prior distributions over the topic transition Mul(αz′) and emission probabilities Mul(βz), and use the Variational Bayesian (VB) (Jordan et al. [sent-149, score-0.512]
61 Since both the topic transition and emission probabilities are Multinomial distributions in strTM, the conjugate Dirichlet distribution is the natural 1529 choice for imposing a prior on them (Diaconis and Ylvisaker, 1979). [sent-151, score-0.534]
62 The optimal setting of π for the proportion of content topics in the documents is empirically tuned by cross-validation over the training corpus to maximize the log-likelihood. [sent-156, score-0.25]
63 5 Experimental Results In this section, we demonstrate the effectiveness of strTM in identifying latent topical structures from documents, and quantitatively evaluate how the mined topic transitions can help the tasks of sentence annotation and sentence ordering. [sent-157, score-0.905]
64 1 Data Set We used two different data sets for evaluation: apartment advertisements (Ads) from (Grenager et al. [sent-159, score-0.139]
65 The Ads data consists of 8,767 advertisements for apartment rentals crawled from Craigslist website. [sent-162, score-0.139]
66 The sentence-level annotations make it possible to quantitatively evaluate the discovered topic structures. [sent-171, score-0.338]
67 2 Topic Transition Modeling First, we qualitatively demonstrate the topical structure identified by strTM from Ads data1 . [sent-177, score-0.25]
68 Figure 2 shows the identified topics and the transitions among them. [sent-179, score-0.239]
69 From Figure 2, we can find some interesting topical structures. [sent-182, score-0.217]
70 1Due to the page limit, we only show the result in Ads data ignuawptcirlaubtidegr ds lskpatiruo cnkahdimgenr y trasnohlcbpola rpstieonagti TiaEpnLcfeoErimnPHat imOlctoeNnE dmleryoepanosrtehi nsemkipgtcoe kathi esngt Figure 2: Estimated topics and topical transitions in Ads data set ment ads. [sent-184, score-0.456]
71 To further quantitatively evaluate the estimated topic transitions, we used Kullback-Leibler (KL) divergency between the estimated transition matrix and the “ground-truth” transition matrix as the metric. [sent-189, score-0.777]
72 Each element of the “ground-truth” transition matrix was calculated by Eq(9), where c(z, z′) denotes how many sentences annotated by z′ immediately precede one annotated by z. [sent-190, score-0.176]
73 p¯(z|z′) =cc(z(z,)z +′) + kδ δ (9) The KL divergency between two transition matrices is defined in Eq(10). [sent-193, score-0.183]
74 Because none of these three methods can generate a topic transition matrix directly, we extended them a little bit to achieve this goal. [sent-199, score-0.454]
75 For pLSA, we used the document-level labels as priors for the topic distribution in each document, so that the estimated topics can be aligned with the predefined class labels. [sent-200, score-0.503]
76 After the topics were estimated, for each sentence we selected the topic that had the highest posterior probability to generate the sentence as its class label. [sent-201, score-0.601]
77 After the sentences were annotated with class labels, we estimated the topic transition matrices for all of these three methods by Eq(9). [sent-203, score-0.494]
78 The “ground-truth” transition matrix was estimated based on all the 302 annotated ads. [sent-206, score-0.185]
79 372 – Table 2: Comparison of estimated topic transitions on Ads data set In Table 2, the p-value was calculated based on ttest of the KL divergency between each topic’s transition probability against strTM. [sent-213, score-0.639]
80 This demonstrates that strTM captures the topical structure well, compared with other baseline methods. [sent-215, score-0.281]
81 3 Sentence Annotation In this section, we demonstrate how the identified topical structure can benefit the task of sentence annotation. [sent-217, score-0.297]
82 Sentence annotation is one step beyond the traditional document classification task: in sentence annotation, we want to predict the class label for each sentence in the document, and this will be helpful for other problems, including extractive summarization and passage retrieval. [sent-218, score-0.232]
83 One advantage of strTM is that it captures the topic transitions on the sentence level within documents, which provides a regularization over the adjacent predictions. [sent-221, score-0.505]
84 As for Naive Bayes model, we used EM algorithm with both labeled and unlabeled data for the training purpose (we used the same unigram features as in topics models). [sent-224, score-0.154]
85 We set the size of topics in each topic model equal to the number of classes in each data set accordingly. [sent-227, score-0.463]
86 To tackle the situation where some sentences in the document are not strictly associated with any classes, we introduced an additional NULL content topic in all the topic models. [sent-228, score-0.809]
87 Compared with lPerm, which postulates a strong constrain over the topic assignment (sampling without replacement), strTM performed much better on both of these two data sets. [sent-267, score-0.344]
88 This result shows the advantage of explicitly modeling the topic transitions between neighbor sentences instead of using a binary relation to do so as in HTMM. [sent-270, score-0.449]
89 To further testify how the identified topical structure can help the sentence annotation task, we first randomly removed 100 annotated ads from the training corpus and used them as the testing set. [sent-271, score-0.482]
90 Then, we used the ground-truth topic transition matrix estimated from the training data to order those 100 ads according to their fitness scores under the groundtruth topic transition matrix, which is defined in Eq(1 1). [sent-272, score-1.128]
91 fitness(d) =|d1|∑i=|d0|log p¯ (ti|ti−1) (11) where ti is the class label for ith sentence in document d, |d| is the number of sentences in docuummeennt d, ,an |dd| p¯(ti |ti−1) misb tehre o ftra snesnitteionnc probability estimated by Eq(9). [sent-274, score-0.266]
92 This comparison confirms that when a testing document shares similar topic struc- ture as the training data, the topical transitions captured by strTM can help the sentence annotation task a lot. [sent-296, score-0.823]
93 4 Sentence Ordering In this experiment, we illustrate how the learned topical structure can help us better arrange sentences in a document. [sent-299, score-0.281]
94 In strTM, we evaluate all the possible orderings of the sentences in a given document and selected the optimal one which gives the highest generation probability: σ¯(m) = arσg(mm)ax∑zp(Sσ[0],Sσ[1],. [sent-302, score-0.158]
95 To quantitatively evaluate the ordering result, we treated the original sentence order (OSO) as the perfect order and used Kendall’s τ(σ) (Lapata, 2006) as the evaluation metric to compute the divergency between the optimum ordering given by the model and OSO. [sent-306, score-0.275]
96 Table 6: Sample results for document ordering by strTM experiment was performed on both data sets with 80% data for training and the other 20% for testing. [sent-335, score-0.156]
97 6 Conclusions In this paper, we proposed a new structural topic model (strTM) to identify the latent topical structure in documents. [sent-347, score-0.669]
98 Different from the traditional topic models, in which exchangeability assumption precludes them to capture the structure of a document, strTM captures the topical structure explicitly by introducing transitions among the topics. [sent-348, score-0.708]
99 Experi- We see that all methods performed better on the Ads data set than the review data set, suggesting that the topical structures are more coherent in the Ads data set than the review data. [sent-349, score-0.359]
100 Topic segmentation with shared topic detection and alignment of multiple documents. [sent-541, score-0.333]
wordName wordTfidf (topN-words)
[('strtm', 0.678), ('topic', 0.309), ('zi', 0.226), ('topical', 0.217), ('htmm', 0.208), ('topics', 0.154), ('lperm', 0.152), ('ads', 0.151), ('transition', 0.114), ('plsa', 0.106), ('document', 0.104), ('eq', 0.092), ('zb', 0.085), ('transitions', 0.085), ('apartment', 0.083), ('transitional', 0.083), ('blei', 0.078), ('divergency', 0.069), ('mul', 0.067), ('latent', 0.063), ('emission', 0.058), ('advertisements', 0.056), ('content', 0.056), ('wij', 0.053), ('ordering', 0.052), ('functional', 0.051), ('si', 0.05), ('review', 0.048), ('structural', 0.047), ('sentence', 0.047), ('structures', 0.046), ('avgkl', 0.042), ('bedrooms', 0.042), ('laundry', 0.042), ('oso', 0.042), ('pets', 0.042), ('estimated', 0.04), ('documents', 0.04), ('zhai', 0.038), ('zhuang', 0.037), ('gruber', 0.037), ('em', 0.036), ('assignment', 0.035), ('annotation', 0.034), ('kl', 0.034), ('mixture', 0.034), ('structure', 0.033), ('adjacent', 0.033), ('movie', 0.033), ('fitness', 0.032), ('discourse', 0.032), ('prior', 0.031), ('matrix', 0.031), ('sentences', 0.031), ('please', 0.031), ('captures', 0.031), ('markov', 0.03), ('quantitatively', 0.029), ('sm', 0.028), ('barzilay', 0.028), ('mined', 0.028), ('share', 0.028), ('appointment', 0.028), ('bath', 0.028), ('famyly', 0.028), ('groundtruth', 0.028), ('lov', 0.028), ('postulating', 0.028), ('hmm', 0.027), ('captured', 0.027), ('treated', 0.026), ('cohesion', 0.026), ('dirichlet', 0.025), ('hofmann', 0.025), ('kendall', 0.025), ('jordan', 0.025), ('contact', 0.024), ('multinomial', 0.024), ('modeling', 0.024), ('diaconis', 0.024), ('precison', 0.024), ('segmentation', 0.024), ('grenager', 0.024), ('naive', 0.024), ('drawn', 0.023), ('bayes', 0.023), ('orderings', 0.023), ('nb', 0.023), ('conjugate', 0.022), ('deficiency', 0.022), ('zz', 0.022), ('sac', 0.022), ('ti', 0.022), ('skewed', 0.022), ('mei', 0.022), ('probability', 0.022), ('posterior', 0.022), ('discover', 0.021), ('reviews', 0.021), ('embedded', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis
Author: Hongning Wang ; Duo Zhang ; ChengXiang Zhai
Abstract: Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. ,
2 0.22125611 52 acl-2011-Automatic Labelling of Topic Models
Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
3 0.21508725 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens
Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.
4 0.20298098 117 acl-2011-Entity Set Expansion using Topic information
Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui
Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.
5 0.18975671 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
Author: Risa Kitajima ; Ichiro Kobayashi
Abstract: Recently, several latent topic analysis methods such as LSI, pLSI, and LDA have been widely used for text analysis. However, those methods basically assign topics to words, but do not account for the events in a document. With this background, in this paper, we propose a latent topic extracting method which assigns topics to events. We also show that our proposed method is useful to generate a document summary based on a latent topic.
6 0.17575611 178 acl-2011-Interactive Topic Modeling
7 0.16484419 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
8 0.15961318 14 acl-2011-A Hierarchical Model of Web Summaries
9 0.12623346 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
10 0.11377214 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
11 0.11085302 305 acl-2011-Topical Keyphrase Extraction from Twitter
12 0.098329388 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
13 0.097716726 204 acl-2011-Learning Word Vectors for Sentiment Analysis
14 0.094934143 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation
15 0.091868959 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
16 0.090348646 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
17 0.084706917 101 acl-2011-Disentangling Chat with Local Coherence Models
18 0.076382644 53 acl-2011-Automatically Evaluating Text Coherence Using Discourse Relations
19 0.075865768 76 acl-2011-Comparative News Summarization Using Linear Programming
20 0.075313665 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
topicId topicWeight
[(0, 0.176), (1, 0.114), (2, -0.051), (3, 0.102), (4, -0.021), (5, -0.106), (6, -0.119), (7, 0.273), (8, -0.015), (9, 0.093), (10, -0.064), (11, 0.065), (12, 0.082), (13, -0.041), (14, 0.169), (15, 0.046), (16, -0.047), (17, -0.024), (18, -0.084), (19, 0.047), (20, -0.029), (21, 0.099), (22, -0.092), (23, 0.031), (24, 0.019), (25, 0.016), (26, -0.017), (27, 0.065), (28, -0.052), (29, 0.016), (30, 0.002), (31, 0.016), (32, 0.024), (33, 0.051), (34, -0.046), (35, -0.008), (36, 0.003), (37, 0.031), (38, -0.032), (39, 0.002), (40, 0.018), (41, 0.001), (42, -0.008), (43, -0.013), (44, -0.068), (45, -0.015), (46, -0.022), (47, -0.003), (48, -0.007), (49, -0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.97039711 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis
Author: Hongning Wang ; Duo Zhang ; ChengXiang Zhai
Abstract: Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. ,
2 0.91653156 178 acl-2011-Interactive Topic Modeling
Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff
Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.
3 0.8797735 52 acl-2011-Automatic Labelling of Topic Models
Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
4 0.83986455 14 acl-2011-A Hierarchical Model of Web Summaries
Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani
Abstract: We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.
5 0.82219058 117 acl-2011-Entity Set Expansion using Topic information
Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui
Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.
6 0.80655503 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
7 0.80310762 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
8 0.79383558 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
9 0.76270652 305 acl-2011-Topical Keyphrase Extraction from Twitter
10 0.68636245 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
11 0.58701897 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
12 0.51210034 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
13 0.49263534 109 acl-2011-Effective Measures of Domain Similarity for Parsing
14 0.46798298 101 acl-2011-Disentangling Chat with Local Coherence Models
15 0.45372856 150 acl-2011-Hierarchical Text Classification with Latent Concepts
16 0.43051815 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
17 0.42088282 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues
18 0.41823441 342 acl-2011-full-for-print
19 0.4120914 76 acl-2011-Comparative News Summarization Using Linear Programming
20 0.40767887 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing
topicId topicWeight
[(5, 0.025), (15, 0.267), (17, 0.048), (26, 0.025), (37, 0.096), (39, 0.065), (41, 0.075), (53, 0.01), (55, 0.035), (59, 0.035), (72, 0.03), (91, 0.033), (96, 0.152), (97, 0.012)]
simIndex simValue paperId paperTitle
1 0.94030786 138 acl-2011-French TimeBank: An ISO-TimeML Annotated Reference Corpus
Author: Andre Bittar ; Pascal Amsili ; Pascal Denis ; Laurence Danlos
Abstract: This article presents the main points in the creation of the French TimeBank (Bittar, 2010), a reference corpus annotated according to the ISO-TimeML standard for temporal annotation. A number of improvements were made to the markup language to deal with linguistic phenomena not yet covered by ISO-TimeML, including cross-language modifications and others specific to French. An automatic preannotation system was used to speed up the annotation process. A preliminary evaluation of the methodology adopted for this project yields positive results in terms of data quality and annotation time.
same-paper 2 0.77311206 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis
Author: Hongning Wang ; Duo Zhang ; ChengXiang Zhai
Abstract: Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. ,
3 0.75536132 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
Author: Lonneke van der Plas ; Paola Merlo ; James Henderson
Abstract: Broad-coverage semantic annotations for training statistical learners are only available for a handful of languages. Previous approaches to cross-lingual transfer of semantic annotations have addressed this problem with encouraging results on a small scale. In this paper, we scale up previous efforts by using an automatic approach to semantic annotation that does not rely on a semantic ontology for the target language. Moreover, we improve the quality of the transferred semantic annotations by using a joint syntacticsemantic parser that learns the correlations between syntax and semantics of the target language and smooths out the errors from automatic transfer. We reach a labelled F-measure for predicates and arguments of only 4% and 9% points, respectively, lower than the upper bound from manual annotations.
4 0.67709661 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
Author: Dong Wang ; Yang Liu
Abstract: This paper presents a pilot study of opinion summarization on conversations. We create a corpus containing extractive and abstractive summaries of speaker’s opinion towards a given topic using 88 telephone conversations. We adopt two methods to perform extractive summarization. The first one is a sentence-ranking method that linearly combines scores measured from different aspects including topic relevance, subjectivity, and sentence importance. The second one is a graph-based method, which incorporates topic and sentiment information, as well as additional information about sentence-to-sentence relations extracted based on dialogue structure. Our evaluation results show that both methods significantly outperform the baseline approach that extracts the longest utterances. In particular, we find that incorporating dialogue structure in the graph-based method contributes to the improved system performance.
5 0.62890899 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing
Author: Nathan Bodenstab ; Aaron Dunlop ; Keith Hall ; Brian Roark
Abstract: Efficient decoding for syntactic parsing has become a necessary research area as statistical grammars grow in accuracy and size and as more NLP applications leverage syntactic analyses. We review prior methods for pruning and then present a new framework that unifies their strengths into a single approach. Using a log linear model, we learn the optimal beam-search pruning parameters for each CYK chart cell, effectively predicting the most promising areas of the model space to explore. We demonstrate that our method is faster than coarse-to-fine pruning, exemplified in both the Charniak and Berkeley parsers, by empirically comparing our parser to the Berkeley parser using the same grammar and under identical operating conditions.
6 0.62824762 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
7 0.62753522 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
8 0.62543589 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
9 0.62417543 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
10 0.62411535 28 acl-2011-A Statistical Tree Annotator and Its Applications
11 0.62269324 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
12 0.62261832 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
13 0.62238097 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
14 0.62183446 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
15 0.62129307 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
16 0.62080514 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
17 0.62061608 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
18 0.62035173 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing
19 0.61977446 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
20 0.61975962 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding