acl acl2011 acl2011-52 knowledge-graph by maker-knowledge-mining

52 acl-2011-Automatic Labelling of Topic Models

Source: pdf

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract We propose a method for automatically labelling topics learned via LDA topic models. [sent-5, score-0.979]

2 We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. [sent-6, score-1.584]

3 We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. [sent-7, score-0.534]

4 One standard way of interpreting a topic is to use the marginal probabilities p(wi |tj) associated with each term wi in a given topic tj |tot extract out the 10 terms with highest marginal probability. [sent-12, score-1.27]

5 This results in term lists such as:1 smtoecnkt f imrmark e xtch ianvnge st co rm fupnadnie trsad shianrge invest1Here and throughout the paper, we will represent a topic tj via its ranking of top-10 topic terms, based on p(wi |tj ). [sent-13, score-1.173]

6 The aim of this research is to automatically generate topic labels which explicitly identify the semantics of the topic, i. [sent-20, score-0.602]

7 We translate each topic label into features extracted from Wikipedia, lexical association with the topic terms in Wikipedia documents, and also lexical features for the component terms. [sent-24, score-1.247]

8 This is used as the basis of a support vector regression model, which ranks each topic label candidate. [sent-25, score-0.682]

9 2 Related Work Topics are conventionally interpreted via their topN terms, ranked based on the marginal probability p(wi |tj) in that topic (Blei et al. [sent-27, score-0.563]

10 (2007), who proposed various unsupervised approaches for automatically labelling topics, based on: (1) generating label candidates by extracting either bigrams or noun chunks from the document collection; and (2) ranking the label candidates based on KL divergence with a given topic. [sent-36, score-1.186]

11 Their proposed methodology generates a generic list of label candidates for all topics using only the document collection. [sent-37, score-0.713]

12 (2009) proposed a method for labelling topics induced by a hierarchical topic model. [sent-42, score-0.979]

13 Their label candidate set is the Google Directory (gDir) hierarchy, and label selection takes the form of ontological alignment with gDir. [sent-43, score-0.65]

14 However, the method is only applicable to a hierarchical topic model and crucially relies on a pre-existing ontology and the class labels contained therein. [sent-45, score-0.607]

15 With topic modelling, however, the top-ranking topic terms tended to be associated and not lexically similar to one another. [sent-48, score-1.049]

16 It is thus highly questionable whether their method could be applied to topic models, but it would certainly be interesting to investi- gate whether our model could conversely be applied to the labelling of sets of near-synonyms. [sent-49, score-0.762]

17 (2010) proposed to approach topic labelling via best term selection, i. [sent-51, score-0.761]

18 selecting one of the top-10 topic terms to label the overall topic. [sent-53, score-0.763]

19 While it is often possible to label topics with topic terms (as is the case with the stock market topic above), there are also often cases where topic terms are not appropriate as labels. [sent-54, score-2.175]

20 1537 While not directly related to topic labelling, Chang et al. [sent-57, score-0.484]

21 (2009) were one of the first to propose human labelling of topic models, in the form of synthetic intruder word and topic detection tasks. [sent-58, score-1.224]

22 In the intruder word task, they include a term w with low marginal probability p(w|t) for topic t into the topmNa topic terms, bainldit ye pva(lwua|tte) fhoorw to pwiecll t b inottho thhuem toapnsand their model are able to detect the intruder. [sent-59, score-1.116]

23 The potential applications for automatic labelling of topics are many and varied. [sent-60, score-0.464]

24 , the topic model can be used as the basis for generating a two-dimensional representation of the document collection (Newman et al. [sent-63, score-0.609]

25 Regions where documents have a high marginal probability p(di |tj) of being associated with a given topic can b|et explicitly labelled with the learned label, rather than just presented as an unlabelled region, or presented with a dense “term cloud” from the original topic. [sent-65, score-0.559]

26 In topic modelbased selectional preference learning (Ritter et al. [sent-66, score-0.515]

27 , 2010; O` S ´eaghdha, 2010), the learned topics can be translated into semantic class labels (e. [sent-67, score-0.338]

28 In dynamic topic models tracking the diachronic evolution of topics in time-sequenced document collections (Blei and Lafferty, 2006), labels can greatly enhance the interpretation of what topics are “trending” at any given point in time. [sent-70, score-1.214]

29 3 Methodology The task of automatic labelling of topics is a natural progression from the best topic term selection task of Lau et al. [sent-71, score-1.031]

30 In that work, the authors use a reranking framework to produce a ranking of the top-10 topic terms based on how well each term in isolation represents a topic. [sent-73, score-0.71]

31 topic example, the term trading could be considered as a more representative term of the overall semantics of the topic than the top-ranked topic term stock. [sent-77, score-1.684]

32 In this paper, we propose a novel method for automatic topic labelling that first generates topic label candidates using English Wikipedia, and then ranks the candidates to select the best topic labels. [sent-82, score-2.217]

33 By making this assumption, the difficult task of gen- erating potential topic labels is transposed to finding relevant Wikipedia articles, and using the title of each article as a topic label candidate. [sent-85, score-1.338]

34 We first use the top-10 topic terms (based on the marginal probabilities from the original topic model) to query Wikipedia, using: (a) Wikipedia’s native search API; and (b) a site-restricted Google search. [sent-86, score-1.1]

35 The combined set of top-8 article titles returned from the two search engines for each topic constitutes the initial set of primary candidates. [sent-87, score-0.676]

36 In this way, an average of 30–40 secondary labels are produced for each topic based on noun chunk n- grams. [sent-91, score-0.717]

37 A good portion of these labels are commonly stopwords or unigrams that are only marginally related to the topic (an artifact of the n-gram generation process). [sent-92, score-0.602]

38 The final score for each secondary label candidate is calculated as the average RACO score with each of the primary label candidates. [sent-103, score-0.728]

39 1 and above are added to the label candidate set. [sent-105, score-0.394]

40 Finally, we add the top-5 topic terms to the set of candidates, based on the marginals from the original topic model. [sent-106, score-1.049]

41 Doing this ensures that there are always label candidates for all topics (even if the Wikipedia searches fail), and also allows the possibility of labeling a topic using its own topic terms, which was demonstrated by Lau et al. [sent-107, score-1.571]

42 (2010) to be a baseline source of topic label candidates. [sent-108, score-0.682]

43 2 Candidate Ranking After obtaining the set of topic label candidates, the next step is to rank the candidates to find the best label for each topic. [sent-110, score-1.039]

44 1 Features A good label should be strongly associated with the topic terms. [sent-114, score-0.682]

45 To learn the association of a label candidate with the topic terms, we use several lexical association measures: pointwise mutual information (PMI), Student’s t-test, Dice’s coefficient, Pearson’s χ2 test, and the log likelihood ratio (Pecina, 2009). [sent-115, score-0.878]

46 To calculate the association measures, we parse the full collection of English Wikipedia articles using a sliding window of width 20, and obtain term frequencies for the label candidates and topic terms. [sent-118, score-0.992]

47 To measure the association between a label candidate and a list of topic terms, we average the scores of the top-10 topic terms. [sent-119, score-1.41]

48 In addition to the association measures, we include two lexical properties ofthe candidate: the raw number of terms, and the relative number of terms in the label candidate that are top-10 topic terms. [sent-120, score-0.959]

49 We also include a search engine score for each label candidate, which we generate by querying a local copy of English Wikipedia with the top-10 topic terms, using the Zettair search engine (based on BM25 term similarity). [sent-121, score-0.767]

50 2 Unsupervised and Supervised Ranking Each of the proposed features can be used as the basis for an unsupervised model for label candidate selection, by ranking the label candidates for a given topic and selecting the top-N. [sent-125, score-1.355]

51 Alternatively, they can be combined in a supervised model, by training over topics where we have gold-standard labelling of the label candidates. [sent-126, score-0.718]

52 The BOOKS topics, coming from public-domain out-of-copyright books (with publication dates spanning more than a century), relate to a wide range of topics including furniture, home decoration, religion and art, and have a more historic feel to them. [sent-135, score-0.397]

53 The NEWS topics reflect the types and range of subjects one might expect in news articles such as health, finance, entertainment, and politics. [sent-136, score-0.342]

54 The PUBMED topics frequently contain domain-specific terms and are sharply differentiated from the topics for the other corpora. [sent-137, score-0.573]

55 We are particularly interested in the performance of the method over PUBMED, as it is a highly specialised domain where we may expect lower cov- erage of appropriate topic labels within Wikipedia. [sent-138, score-0.636]

56 We took a standard approach to topic modelling each ofthe four document collections: we tokenised, lemmatised and stopped each document,5 and created a vocabulary of terms that occurred at least ten times. [sent-139, score-0.669]

57 From this processed data, we created a bag-of-words representation of each document, and learned topic models with T = 100 topics in each case. [sent-140, score-0.73]

58 (2010b) to calculate the average PMI-score for each topic, and filtered all topics that had an average PMI-score lower than 0. [sent-142, score-0.368]

59 We additionally filtered any topics where less than 5 of the top-10 topic terms are default nominal in Wikipedia. [sent-144, score-0.837]

60 6 The filtering criteria resulted in 45 topics for BLOGS, 38 topics for BOOKS, 60 topics for NEWS, and 85 topics for PUBMED. [sent-145, score-0.984]

61 Manual inspection of the discarded topics indicated that they were predominantly hard-to-labeljunk topics or mixed topics, with limited utility for document/term clustering. [sent-146, score-0.492]

62 Applying our label candidate generation method- ology to these 228 topics produced approximately 6000 labels — an average of 27 labels per topic. [sent-147, score-0.872]

63 Human Intelligence Task (HIT); it contains a topic followed by 10 suggested topic labels, which are to be rated. [sent-152, score-0.968]

64 In our annotation task, each topic was presented in the form of its top-10 terms, followed by 10 suggested labels for the topic. [sent-157, score-0.576]

65 1: Label is semantically related to the topic, but would not make a good topic label. [sent-162, score-0.484]

66 36 Table 1: A sample of topics and topic labels, along with the average rating for each label candidate plied a few heuristics to automatically detect these workers. [sent-168, score-1.309]

67 Additionally, we inserted a small number of stopwords as label candidates in each HIT and recorded workers who gave high ratings to these stopwords. [sent-169, score-0.445]

68 Each label candidate was rated in this way by at least 10 annotators, and ratings from annotators who passed the filter were combined by averaging them. [sent-171, score-0.492]

69 A sample oftopics, label candidates, and the average rating is presented in Table 1. [sent-172, score-0.383]

70 5 Experiments In this section we present our experimental results for the topic labelling task, based on both the unsu- pervised and supervised methods, and the methodology of Mei et al. [sent-174, score-0.79]

71 Top-1 average rating is the average annotator rating given to the top-ranked system label, and has a maximum value of 3 (where annotators unanimously rated all top-ranked system labels with a 3). [sent-178, score-0.528]

72 nDCG is a normalised score, and indicates how close the candidate label ranking is to the optimal ranking within the set of annotated candidates, noting that an nDCG-N score of 1 tells us nothing about absolute values of the candidates. [sent-189, score-0.594]

73 Note that conventional precision- and recall-based evaluation is not appropriate for our task, as each label candidate has a real-valued rating. [sent-191, score-0.394]

74 As a baseline for the task, we use the unsupervised label candidate ranking method based on Pearson’s χ2 test, as it was overwhelmingly found to be the pick of the features for candidate ranking. [sent-192, score-0.741]

75 2 Results for the Supervised Method For the supervised model, we present both indomain results based on 10-fold cross-validation, and cross-domain results where we learn a model from the ratings for the topic model from a given domain, and apply it to a second domain. [sent-194, score-0.598]

76 We present the results for the supervised method in Table 2, including the unsupervised baseline and an upper bound estimate for comparison purposes. [sent-200, score-0.338]

77 The upper bound for top-1 average rating is thus the highest average human rating of all label candidates for a given topic, while the upper bound for the nDCG measures will always be 1. [sent-342, score-1.196]

78 Overall though, the results are very encouraging, 1542 and point to the plausibility of using labelled topic models from independent domains to learn the best topic labels for a new domain. [sent-346, score-1.116]

79 Looking to the top-1 average score results over the different candidate sets, we observe first that the upper bound for the combined candidate set (“All”) is higher than the scores for the candidate subsets in all cases, underlining the complementarity of the different candidate sets. [sent-353, score-1.08]

80 We also observe that the top-5 topic term candidate set is the lowest performer out of the three subsets across all four corpora, in terms of both upper bound and the results for the supervised method. [sent-354, score-1.124]

81 This reinforces our comments about the inferiority of the topic word selection method of Lau et al. [sent-355, score-0.539]

82 For NEWS and PUBMED, there is a noticeable difference between the results of the supervised method over the full candidate set and each of the candidate subsets. [sent-357, score-0.479]

83 In contrast, for BOOKS and BLOGS, the results for the primary candidate subset are at times actually higher than those over the full candidate set in most cases (but not for the upper bound). [sent-358, score-0.547]

84 This is due to the larger search space in the full candidate set, and the higher median quality of candidates in the primary candidate set. [sent-359, score-0.592]

85 1 Candidate Ranking First, we compare the candidate ranking methodology of our method with that of MSZ, using the label candidates extracted by the MSZ method. [sent-366, score-0.702]

86 We then ranked the bigrams for each topic using the Student’s t-test. [sent-368, score-0.568]

87 We included the top-5 labels generated for each topic by the MSZ method 1543 in our Mechanical Turk annotation task, and use the annotations to directly compare the two methods. [sent-369, score-0.635]

88 To measure the performance of candidate ranking between our supervised method and MSZ’s, we re-rank the top-5 labels extracted by MSZ using our SVR methodology (in-domain) and compare the top-1 average rating and nDCG scores. [sent-370, score-0.678]

89 In our method, on the other hand, we (efficiently) access the much larger and variable n-gram length set of English Wikipedia article titles, in addition to the top-5 topic terms. [sent-381, score-0.564]

90 To better understand the differences in label candidate sets, and the relative coverage of the full label candidate set in each case, we conducted another survey where human users were asked to suggest one topic label for each topic presented. [sent-382, score-1.954]

91 With that said, the PUBMED topics are still a subject of interest, as these topics often contain biomedical terms which could be difficult for the general populace to annotate. [sent-429, score-0.573]

92 As the number of annotators per topic and the number of annotations per annotator vary, there is no immediate way to calculate the inter-annotator agreement. [sent-430, score-0.578]

93 Instead, we calculated the MAE score for each candidate, which is an average of the absolute difference between an annotator’s rating and the average rating of a candidate, summed across all candidates to get the MAE score for a given corpus. [sent-431, score-0.529]

94 56 Table 4: Average MAE score for label candidate rating over each corpus agreement, almost certainly because of the greater immediacy of the topics, covering everyday areas such as lifestyle and politics. [sent-438, score-0.56]

95 BOOKS topics are occasionally difficult to label due to the breadth of the domain; e. [sent-439, score-0.444]

96 consider a topic containing terms extracted from Shakespeare sonnets. [sent-441, score-0.565]

97 7 Conclusion This paper has presented the task of topic labelling, that is the generation and scoring of labels for a given topic. [sent-442, score-0.576]

98 We generate a set of label candidates from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and also a filtered set of sub-phrases extracted from the Wikipedia article titles. [sent-443, score-1.573]

99 We rank the label candidates using a combination of association measures, lexical features and an Information Retrieval feature. [sent-444, score-0.357]

100 Visualizing document collections and search results using topic mapping. [sent-579, score-0.601]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('topic', 0.484), ('msz', 0.285), ('pubmed', 0.251), ('topics', 0.246), ('labelling', 0.218), ('label', 0.198), ('candidate', 0.196), ('wikipedia', 0.168), ('candidates', 0.159), ('lau', 0.139), ('rating', 0.137), ('books', 0.125), ('svr', 0.117), ('upper', 0.114), ('blogs', 0.104), ('bound', 0.103), ('raco', 0.095), ('ndcg', 0.093), ('labels', 0.092), ('ranking', 0.086), ('terms', 0.081), ('article', 0.08), ('document', 0.078), ('ccrro', 0.076), ('mae', 0.076), ('mmaai', 0.076), ('titles', 0.071), ('newman', 0.067), ('grieser', 0.067), ('tj', 0.06), ('stock', 0.059), ('term', 0.059), ('ratings', 0.058), ('market', 0.058), ('supervised', 0.056), ('bigrams', 0.056), ('trading', 0.055), ('news', 0.051), ('marginal', 0.051), ('mei', 0.05), ('average', 0.048), ('hit', 0.048), ('secondary', 0.047), ('collection', 0.047), ('constitution', 0.046), ('chunk', 0.046), ('articles', 0.045), ('blei', 0.045), ('primary', 0.041), ('annotators', 0.04), ('collections', 0.039), ('bbolookgss', 0.038), ('gothic', 0.038), ('intruder', 0.038), ('karimi', 0.038), ('magatti', 0.038), ('opennlp', 0.037), ('measures', 0.035), ('croft', 0.034), ('unsupervised', 0.034), ('dated', 0.034), ('jarvelin', 0.034), ('ontological', 0.034), ('museum', 0.034), ('outlinks', 0.034), ('suitability', 0.034), ('baldwin', 0.033), ('methodology', 0.032), ('domains', 0.032), ('method', 0.031), ('selectional', 0.031), ('subsets', 0.031), ('dice', 0.031), ('immune', 0.031), ('nicta', 0.031), ('workers', 0.03), ('interpretation', 0.029), ('certainly', 0.029), ('gauge', 0.029), ('minnen', 0.029), ('specialised', 0.029), ('blog', 0.029), ('annotations', 0.028), ('category', 0.028), ('ranked', 0.028), ('normalised', 0.028), ('brody', 0.028), ('turk', 0.027), ('annotator', 0.026), ('modelling', 0.026), ('dates', 0.026), ('australian', 0.026), ('marginally', 0.026), ('student', 0.026), ('generate', 0.026), ('filtered', 0.026), ('mechanical', 0.025), ('selection', 0.024), ('labelled', 0.024), ('united', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 52 acl-2011-Automatic Labelling of Topic Models

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

2 0.32273474 117 acl-2011-Entity Set Expansion using Topic information

Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui

Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.

3 0.30599836 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens

Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.

4 0.25331676 178 acl-2011-Interactive Topic Modeling

Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff

Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.

5 0.23722632 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

Author: Risa Kitajima ; Ichiro Kobayashi

Abstract: Recently, several latent topic analysis methods such as LSI, pLSI, and LDA have been widely used for text analysis. However, those methods basically assign topics to words, but do not account for the events in a document. With this background, in this paper, we propose a latent topic extracting method which assigns topics to events. We also show that our proposed method is useful to generate a document summary based on a latent topic.

6 0.22125611 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis

7 0.19921656 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

8 0.19038619 14 acl-2011-A Hierarchical Model of Web Summaries

9 0.16930312 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

10 0.14315933 305 acl-2011-Topical Keyphrase Extraction from Twitter

11 0.13602002 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

12 0.12798584 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

13 0.1275381 109 acl-2011-Effective Measures of Domain Similarity for Parsing

14 0.11966535 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

15 0.1110047 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

16 0.09990295 285 acl-2011-Simple supervised document geolocation with geodesic grids

17 0.095411763 204 acl-2011-Learning Word Vectors for Sentiment Analysis

18 0.094219409 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

19 0.093080372 169 acl-2011-Improving Question Recommendation by Exploiting Information Need

20 0.092555359 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.233), (1, 0.159), (2, -0.092), (3, 0.169), (4, -0.052), (5, -0.15), (6, -0.12), (7, 0.252), (8, -0.041), (9, 0.02), (10, -0.135), (11, 0.052), (12, 0.13), (13, -0.075), (14, 0.254), (15, 0.076), (16, 0.04), (17, -0.133), (18, -0.086), (19, -0.041), (20, -0.015), (21, 0.085), (22, -0.103), (23, -0.041), (24, 0.054), (25, 0.038), (26, -0.021), (27, 0.033), (28, -0.091), (29, 0.049), (30, 0.021), (31, -0.015), (32, 0.024), (33, 0.058), (34, -0.047), (35, 0.044), (36, -0.006), (37, 0.002), (38, -0.046), (39, -0.008), (40, 0.085), (41, 0.072), (42, -0.04), (43, 0.029), (44, 0.042), (45, 0.044), (46, 0.015), (47, 0.012), (48, 0.002), (49, -0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98838151 52 acl-2011-Automatic Labelling of Topic Models

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

2 0.89751178 178 acl-2011-Interactive Topic Modeling

Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff

3 0.88129425 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis

Author: Hongning Wang ; Duo Zhang ; ChengXiang Zhai

Abstract: Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. ,

4 0.8623358 117 acl-2011-Entity Set Expansion using Topic information

Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui

5 0.83252621 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens

6 0.79158145 305 acl-2011-Topical Keyphrase Extraction from Twitter

7 0.78209126 14 acl-2011-A Hierarchical Model of Web Summaries

8 0.76406914 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

9 0.72415495 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

10 0.70421201 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

11 0.61967456 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

12 0.53454399 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues

13 0.52957815 109 acl-2011-Effective Measures of Domain Similarity for Parsing

14 0.51950651 285 acl-2011-Simple supervised document geolocation with geodesic grids

15 0.47051471 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

16 0.42593691 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

17 0.4228231 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

18 0.38107848 150 acl-2011-Hierarchical Text Classification with Latent Concepts

19 0.37943032 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

20 0.37686291 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.022), (17, 0.049), (26, 0.02), (37, 0.083), (39, 0.456), (41, 0.033), (55, 0.023), (59, 0.037), (72, 0.032), (91, 0.022), (96, 0.123), (97, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94692373 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

Author: Yoav Goldberg ; Michael Elhadad

Abstract: We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al., 2009) to make use of a stronger syntactic model: the PCFG-LA Berkeley Parser. We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs.

2 0.94455349 1 acl-2011-(11-06-spirl)

Author: (hal)

Abstract: unkown-abstract

same-paper 3 0.92649561 52 acl-2011-Automatic Labelling of Topic Models

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

4 0.92377645 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

Author: Muhua Zhu ; Jingbo Zhu ; Minghan Hu

Abstract: For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1.31% over a strong baseline.

5 0.86837828 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing

Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.

6 0.82139075 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

7 0.80451083 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

8 0.76801646 192 acl-2011-Language-Independent Parsing with Empty Elements

9 0.68208659 182 acl-2011-Joint Annotation of Search Queries

10 0.634673 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

11 0.61375505 282 acl-2011-Shift-Reduce CCG Parsing

12 0.61004126 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

13 0.60682487 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

14 0.60092008 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

15 0.59813714 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

16 0.59547389 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

17 0.59491462 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

18 0.59367687 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

19 0.58432299 238 acl-2011-P11-2093 k2opt.pdf

20 0.58261198 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing