acl acl2011 acl2011-178 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff
Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. [sent-7, score-0.364]
2 However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. [sent-8, score-0.391]
3 In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. [sent-9, score-0.605]
4 1 Introduction Probabilistic topic models, as exemplified by probabilistic latent semantic indexing (Hofmann, 1999) and latent Dirichlet allocation (LDA) (Blei et al. [sent-11, score-0.454]
5 For text, one of the few real-world applications of topic models is corpus exploration. [sent-16, score-0.364]
6 Unannotated, noisy, and ever-growing corpora are the norm rather than the exception, and topic models offer a way to quickly get the gist a large corpus. [sent-17, score-0.364]
7 info /, JSTOR 248 Contrary to the impression given by the tables shown in topic modeling papers, topics discovered by topic modeling don’t always make sense to ostensible end users. [sent-19, score-0.992]
8 Part of the problem is that the http : objective function of topic models doesn’t always correlate with human judgements (Chang et al. [sent-20, score-0.364]
9 Another issue is that topic models with their bagof-words vision of the world simply lack the necessary information to create the topics as end-users expect. [sent-22, score-0.521]
10 There has been a thriving cottage industry adding more and more information to topic models to correct these shortcomings; either by modeling perspective (Paul and Girju, 2010; Lin et al. [sent-23, score-0.419]
11 Similarly, there has been an effort to inject human knowledge into topic models (Boyd-Graber et al. [sent-28, score-0.364]
12 They don’t help a frustrated consumer of topic models staring at a collection of topics that don’t make sense. [sent-33, score-0.521]
13 In this paper, we propose interactive topic modeling (ITM), an in situ method for incorporating human knowledge into topic models. [sent-34, score-0.82]
14 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 248–257, 2 Putting Knowledge in Topic Models At a high level, topic models such as LDA take as input a number of topics K and a corpus. [sent-45, score-0.521]
15 As output, a topic model discovers K distributions over words the namesake topics and associations between documents and topics. [sent-46, score-0.582]
16 When presented with poor topics learned from — — data, users can offer a number of complaints:2 these documents should have similar topics but don’t (Daume´ III, 2009); this topic should have syntactic coherence (Gruber et al. [sent-49, score-0.819]
17 , 2007; Boyd-Graber and Blei, 2008); this topic doesn’t make any sense at all (Newman et al. [sent-50, score-0.364]
18 , 2010); this topic shouldn’t be associated with this document but is (Ramage et al. [sent-51, score-0.416]
19 , 2009); these words shouldn’t be the in same topic but are (Andrzejewski et al. [sent-52, score-0.364]
20 , 2009); or these words should be in the same topic but aren’t (Andrzejewski et al. [sent-53, score-0.364]
21 After this constraint is added, the probabilities of} “plant” and “factory” in each topic are likely to both be high or both be low. [sent-58, score-0.571]
22 It’s unlikely for “plant” to have high probability in a topic and “factory” to have a low probability. [sent-59, score-0.364]
23 In the next section, we demonstrate how such constraints can be built into a model and how they can even be added while inference is underway. [sent-60, score-0.322]
24 In this paper, we view constraints as transitive; if “plant” is in a constraint with “factory” and “factory” is in a constraint with “production,” then “plant” is in a constraint with “production. [sent-61, score-0.835]
25 249 Constraints Prior Structure {} barkdogtrβe βp βlan βtβfactoβryleash {plant, factory} barkdβogβtrβplea2enβηβtfηacletoarsyh { dploagn,tbaf rkc,tolerya}sh}η 3ηβtβre 2βη bark dog leash plant factory Figure 1: How adding constraints (left) creates new topic priors (right). [sent-64, score-1.748]
26 After the {plant, factory} constraint is added, it is now highly unlikely for a topic d}rawn from the distribution to have a high probability for “plant” and a low probability for “factory” or vice versa. [sent-66, score-0.613]
27 Constraints can be added to vanilla LDA by replacing the multinomial distribution over words for each topic with a collection of tree-structured multinomial distributions drawn from a prior as depicted in Figure 1. [sent-78, score-0.613]
28 Each topic has a top-level distribution over words and constraints, and each constraint in each topic has second-level distribution over the words in the constraint. [sent-82, score-1.019]
29 The top level distribution encodes which constraints (and unconstrained words) to include; the lower-level distribution forces the probabilities to be correlated for each of the constraints. [sent-84, score-0.365]
30 In LDA, a document’s token is produced in the generative process by choosing a topic z and sampling a word from the multinomial distribution φz of topic z. [sent-85, score-0.882]
31 choose a topic assignment zd,n ∼ Mult(θd), and then ii. [sent-104, score-0.403]
32 otherwise if we chose a constraint index ld,n, emit a word wd,n from the constraint’s distribution over words in topic zd,n : wd,n Mult(πzd,n,ld,n). [sent-107, score-0.642]
33 1 Gibbs Sampling for Topic Models In topic modeling, collapsed Gibbs sampling (Griffiths and Steyvers, 2004) is a standard procedure for obtaining a Markov chain over the latent variables in the model. [sent-110, score-0.517]
34 Given M documents the state of a Gibbs sampler for LDA consists of topic assignments for each token in the corpus and is represented as Z = {z1,1 . [sent-112, score-0.631]
35 In each iteration, every token’s topic assignment zd,n is resampled based on topic assignments for all the tokens except for zd,n. [sent-119, score-0.862]
36 In order to make the constraints effective, we set the constraint word-distribution hyperparameter η to be much larger than the hyperparameter for the distribution over constraints and vocabulary β. [sent-131, score-0.677]
37 Normally, estimating hyperparameters is important for topic modeling (Wallach et al. [sent-133, score-0.439]
38 However, in ITM, sam- pling hyperparameters often (but not always) undoes the constraints (by making η comparable to β), so we keep the hyperparameters fixed. [sent-135, score-0.31]
39 4 Interactively adding constraints For a static model, inference in ITM is the same as in previous models (Andrzejewski et al. [sent-136, score-0.295]
40 In this section, we detail how interactively changing constraints can be accommodated in ITM, smoothly transitioning from unconstrained LDA (n. [sent-138, score-0.34]
41 A central tool that we will use is the strategic unassignment of states, which we call ablation (distinct from feature ablation in supervised learning). [sent-143, score-0.418]
42 As described in the previous section, a sampler stores the topic assignment of each token. [sent-144, score-0.462]
43 In the implementation of a Gibbs sampler, unassignment is done by setting a token’s topic assignment to an invalid topic (e. [sent-145, score-0.813]
44 The constraints created by users implicitly signal that words in constraints don’t belong in a given topic. [sent-148, score-0.536]
45 Instead, we change the underlying model, using the current topic assignments as a starting position for a new Markov chain with some states strategically unassigned. [sent-153, score-0.512]
46 How much of the existing topic assignments we use leads to four different options, which are illustrated in Figure 2. [sent-154, score-0.459]
47 The state is represented by showing the current topic assignment after each word (e. [sent-156, score-0.453]
48 “leash” in the first document has topic 3, while “forest” in the third document has topic 1). [sent-158, score-0.832]
49 Unassigned words are given the new topic assignment -1 and are highlighted in red. [sent-160, score-0.403]
50 Once the topic assignments of all states are revoked, the counts for T, P and W (as described in Section 3. [sent-163, score-0.459]
51 Doc Because topic models treat the document context as exchangeable, a document is a natural context for partial state ablation. [sent-165, score-0.518]
52 Thus if a user adds a set of words S to constraints, then we have reason to suspect that all documents containing any one of S may have incorrect topic assignments. [sent-166, score-0.44]
53 Term Another option is to perform ablation only on the topic assignments of tokens whose words have added to a constraint. [sent-171, score-0.7]
54 This applies the unassignment operation (Algorithm 1) only to tokens whose corresponding word appears in added constraints (i. [sent-172, score-0.315]
55 This makes it less likely that other tokens in similar contexts will follow the words explicitly included in the constraints to new topic assignments. [sent-175, score-0.578]
56 None The final option is to move words into constraints but keep the topic assignments fixed. [sent-176, score-0.673]
57 In practice, however, this strategy is less interactive, as users don’t feel that their constraints are actually incorporated in the model, and inertia can keep the chain from reflecting the constraints. [sent-179, score-0.462]
58 How many additional iterations are 3This assumes that there is only one possible path in the constraint tree that can generate a word; in other words, this assumes that constraints are transitive, as discussed at the end of Section 2. [sent-181, score-0.586]
59 In the more general case, when words lack a unique path in the constraint tree, an additional latent variable specifies which possible paths in the constraint tree produced the word; this would have to be sampled. [sent-182, score-0.511]
60 Topic 20 seems to be about the Soviet Union, with topic 1 about the post-Soviet years. [sent-189, score-0.364]
61 Running inference forward 100 iterations with the Doc ablation strategy yields the topics in Table 1 (right). [sent-191, score-0.55]
62 This combination also pulled in other relevant words that not near the top of either topic before: “moscow” and “relations. [sent-193, score-0.364]
63 The categories grow more and more specific during the session as the simulated users add more constraint words. [sent-204, score-0.362]
64 To test the ability of ITM to discover relevant subdivisions in a corpus, we use a dataset with predefined, intrinsic labels and assess how well the discovered latent topic structure can reproduce the corpus’s inherent structure. [sent-205, score-0.462]
65 remaining constraints with multiple for different labels The smallest class had 21 words after removing duplicates (due to high 4Our goal is to understand the phenomena of ITM, not classification, so these classification results are well below state of the art. [sent-223, score-0.293]
66 However, adding interactively selected topics to the state of the art features (tf-idf unigrams) gives a relative error reduction of 5. [sent-224, score-0.338]
67 1%, while just adding topics from vanilla LDA gives a relative error reduction of 1. [sent-225, score-0.299]
68 ” We simulate a user’s ITM session by adding a word to each of the 20 constraints until each of the constraints has 21 words. [sent-248, score-0.456]
69 In each round a new constraint is added corresponding to the newsgroup labels. [sent-251, score-0.318]
70 Next, we perform one of the strategies for state ablation, add additional iterations of Gibbs sampling, use the newly obtained topic distribution of each document as the feature vector, and perform classification on the test / train split. [sent-252, score-0.658]
71 More iterations make it harder for the constraints to influence the topic assignment. [sent-261, score-0.691]
72 3 Investigating Ablation Strategies First, we investigate which ablation strategy best allows constraints to be incorporated. [sent-263, score-0.441]
73 Each is averaged over five different chains using 10 additional iterations of Gibbs sampling per round (other numbers of iterations are discussed in Section 6. [sent-265, score-0.337]
74 All Initial runs the model for the only the initial number of iterations (100 iterations in this experiment), while All Full runs the model for the total number of iterations added for the interactive version. [sent-271, score-0.459]
75 (That is, if there were 21 rounds and each round of interactive modeling added 10 iterations, All Full would have 210 iterations more than All Initial). [sent-272, score-0.349]
76 All Full is a lower baseline for the error rate since it both sees the constraints at the beginning and also runs for the maximum number of total iterations. [sent-274, score-0.288]
77 All Initial sees the constraints before the other ablation techniques but it has fewer total iterations. [sent-275, score-0.43]
78 38 Null Term 0 5 10 15 20 Words per constraint Figure 3: Error rate (y-axis, lower is better) using different ablation strategies as additional constraints are added (xaxis). [sent-285, score-0.699]
79 The results of None, Term, Doc are more stable (as denoted by the error bars), and the error rate is reduced gradually as more constraint words are added. [sent-288, score-0.295]
80 The error rate of each interactive ablation strategy is (as expected) between the lower and upper baselines. [sent-289, score-0.336]
81 Generally, the constraints will influence not only the topics of the constraint words, but also the topics of the constraint words’ context in the same document. [sent-290, score-0.942]
82 Doc ablation gives more freedom for the constraints to overcome the inertia of the old topic distribution and move towards a new one influenced by the constraints. [sent-291, score-0.852]
83 For all numbers of additional iterations, while the Null serves as the upper baseline on the error rate in all cases, the Doc ablation clearly outperforms the other ablation schemes, consistently yielding a lower error rate. [sent-298, score-0.46]
84 The luxury of having hundreds or thousands of additional iterations for each constraint would be im10 20 30 50 100 Strategy DN o ulcnle r orE Term Words per constraint Figure 4: Classification accuracy by strategy and number of additional iterations. [sent-301, score-0.568]
85 The Doc ablation strategy performs best, suggesting that the document context is important for ablation constraints. [sent-302, score-0.465]
86 7 Getting Humans in the Loop To move beyond using simulated users adding the same words regardless of what topics were discovered by the model, we needed to expose the model to human users. [sent-307, score-0.393]
87 , 2009), and supplement traditional inference techniques for topic models (Chang, 2010). [sent-310, score-0.417]
88 Users create constraints by clicking on words from the topic word lists. [sent-315, score-0.611]
89 Users see the topics discovered by the model and select words (by clicking on them) to build constraints to be added to the model. [sent-321, score-0.512]
90 2 Untrained users and ITM Most of the large (10+ word) user-created constraints corresponded to the themes of the individual newsgroups, which users were able to infer from the discovered topics. [sent-329, score-0.518]
91 A topic did emerge with these words, but the rest of the words in that topic seemed random, as male first names are not likely to co-occur in the same document. [sent-347, score-0.728]
92 Not all sensible constraints led to successful topic changes. [sent-348, score-0.578]
93 In general, the more words in the constraint, the more likely it was to noticeably affect the topic distribution. [sent-352, score-0.364]
94 A constraint with more words will cause the topic assignments to be reset for more documents. [sent-354, score-0.666]
95 8 Discussion In this work, we introduced a means for end-users to refine and improve the topics discovered by topic models. [sent-355, score-0.574]
96 We demonstrated that even novice users are able to understand and build constraints using a simple interface and that their constraints can improve the model’s ability to capture the latent structure of a corpus. [sent-357, score-0.644]
97 As we learn their needs, we can add more avenues for interacting with topic models. [sent-362, score-0.364]
98 Incorporating domain knowledge into topic modeling via Dirichlet forest priors. [sent-369, score-0.391]
99 Labeled LDA: A supervised topic model for credit attribution in multilabeled corpora. [sent-461, score-0.364]
100 Efficient methods for topic model inference on streaming document collections. [sent-493, score-0.469]
wordName wordTfidf (topN-words)
[('bark', 0.43), ('topic', 0.364), ('itm', 0.23), ('plant', 0.224), ('constraints', 0.214), ('constraint', 0.207), ('lda', 0.186), ('ablation', 0.186), ('leash', 0.184), ('dog', 0.182), ('topics', 0.157), ('andrzejewski', 0.123), ('factory', 0.122), ('iterations', 0.113), ('doc', 0.112), ('users', 0.108), ('assignments', 0.095), ('soviet', 0.07), ('vanilla', 0.07), ('unconstrained', 0.067), ('gibbs', 0.067), ('newsgroups', 0.067), ('interactive', 0.065), ('complaints', 0.061), ('emperical', 0.061), ('sampler', 0.059), ('interactively', 0.059), ('dirichlet', 0.057), ('round', 0.056), ('sampling', 0.055), ('added', 0.055), ('discovered', 0.053), ('chain', 0.053), ('inference', 0.053), ('document', 0.052), ('tree', 0.052), ('state', 0.05), ('hyperparameters', 0.048), ('simulated', 0.047), ('inertia', 0.046), ('unassignment', 0.046), ('latent', 0.045), ('error', 0.044), ('user', 0.043), ('constrained', 0.042), ('distribution', 0.042), ('jordan', 0.041), ('strategy', 0.041), ('russian', 0.041), ('mechanical', 0.041), ('gruber', 0.041), ('assignment', 0.039), ('unite', 0.037), ('turk', 0.037), ('strategies', 0.037), ('blei', 0.036), ('editorials', 0.035), ('themes', 0.035), ('military', 0.035), ('interface', 0.034), ('markov', 0.034), ('rounds', 0.033), ('mult', 0.033), ('clicking', 0.033), ('clinton', 0.033), ('documents', 0.033), ('null', 0.032), ('wallach', 0.032), ('war', 0.032), ('communist', 0.031), ('cuomo', 0.031), ('dietz', 0.031), ('dinkins', 0.031), ('giuliani', 0.031), ('legislature', 0.031), ('mayor', 0.031), ('missile', 0.031), ('moscow', 0.031), ('omld', 0.031), ('pataki', 0.031), ('puppy', 0.031), ('reagan', 0.031), ('rexa', 0.031), ('rudolph', 0.031), ('shringarpure', 0.031), ('stationary', 0.031), ('treaty', 0.031), ('ramage', 0.031), ('token', 0.03), ('sees', 0.03), ('understand', 0.029), ('emit', 0.029), ('windows', 0.029), ('nation', 0.028), ('distributions', 0.028), ('adding', 0.028), ('multinomial', 0.027), ('modeling', 0.027), ('admixture', 0.027), ('administration', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 178 acl-2011-Interactive Topic Modeling
Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff
Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.
2 0.25331676 52 acl-2011-Automatic Labelling of Topic Models
Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
3 0.24962522 117 acl-2011-Entity Set Expansion using Topic information
Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui
Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.
4 0.24163929 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens
Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.
5 0.20100036 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
Author: Risa Kitajima ; Ichiro Kobayashi
Abstract: Recently, several latent topic analysis methods such as LSI, pLSI, and LDA have been widely used for text analysis. However, those methods basically assign topics to words, but do not account for the events in a document. With this background, in this paper, we propose a latent topic extracting method which assigns topics to events. We also show that our proposed method is useful to generate a document summary based on a latent topic.
6 0.18751 14 acl-2011-A Hierarchical Model of Web Summaries
7 0.17575611 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis
8 0.15950547 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
9 0.13444969 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
10 0.12263689 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
11 0.1110779 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
12 0.11074123 177 acl-2011-Interactive Group Suggesting for Twitter
13 0.10862881 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
14 0.1081811 204 acl-2011-Learning Word Vectors for Sentiment Analysis
15 0.097792529 169 acl-2011-Improving Question Recommendation by Exploiting Information Need
16 0.093920015 305 acl-2011-Topical Keyphrase Extraction from Twitter
17 0.087453514 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
18 0.080112904 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
19 0.07043577 109 acl-2011-Effective Measures of Domain Similarity for Parsing
20 0.070252381 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
topicId topicWeight
[(0, 0.194), (1, 0.105), (2, -0.049), (3, 0.125), (4, -0.032), (5, -0.107), (6, -0.131), (7, 0.237), (8, -0.034), (9, 0.073), (10, -0.101), (11, 0.067), (12, 0.16), (13, 0.02), (14, 0.192), (15, 0.051), (16, -0.025), (17, -0.105), (18, -0.126), (19, 0.046), (20, 0.007), (21, 0.132), (22, -0.052), (23, 0.014), (24, 0.017), (25, 0.004), (26, -0.015), (27, 0.086), (28, -0.107), (29, -0.021), (30, 0.011), (31, 0.035), (32, 0.005), (33, 0.037), (34, -0.022), (35, 0.002), (36, 0.03), (37, -0.008), (38, -0.047), (39, -0.016), (40, 0.039), (41, 0.03), (42, -0.011), (43, -0.074), (44, 0.026), (45, 0.058), (46, -0.002), (47, -0.016), (48, -0.013), (49, -0.079)]
simIndex simValue paperId paperTitle
same-paper 1 0.98151934 178 acl-2011-Interactive Topic Modeling
Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff
Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.
2 0.89092219 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis
Author: Hongning Wang ; Duo Zhang ; ChengXiang Zhai
Abstract: Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. ,
3 0.8887369 52 acl-2011-Automatic Labelling of Topic Models
Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
4 0.82521993 117 acl-2011-Entity Set Expansion using Topic information
Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui
Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.
5 0.82122219 14 acl-2011-A Hierarchical Model of Web Summaries
Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani
Abstract: We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.
6 0.80231094 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
7 0.75873131 305 acl-2011-Topical Keyphrase Extraction from Twitter
8 0.73667943 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
9 0.71422273 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
10 0.68651175 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
11 0.60385007 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
12 0.4543812 109 acl-2011-Effective Measures of Domain Similarity for Parsing
13 0.44700629 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
14 0.4338375 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues
15 0.43210384 150 acl-2011-Hierarchical Text Classification with Latent Concepts
16 0.39782244 285 acl-2011-Simple supervised document geolocation with geodesic grids
17 0.36982527 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
18 0.35849392 177 acl-2011-Interactive Group Suggesting for Twitter
19 0.3574084 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
20 0.34269956 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
topicId topicWeight
[(5, 0.03), (17, 0.051), (26, 0.035), (31, 0.012), (37, 0.074), (39, 0.079), (41, 0.068), (55, 0.034), (59, 0.059), (72, 0.036), (87, 0.219), (91, 0.036), (96, 0.128), (97, 0.024)]
simIndex simValue paperId paperTitle
1 0.92510498 239 acl-2011-P11-5002 k2opt.pdf
Author: empty-author
Abstract: unkown-abstract
2 0.8376565 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing
Author: Nathan Bodenstab ; Kristy Hollingshead ; Brian Roark
Abstract: We present a novel pruning method for context-free parsing that increases efficiency by disallowing phrase-level unary productions in CKY chart cells spanning a single word. Our work is orthogonal to recent work on “closing” chart cells, which has focused on multi-word constituents, leaving span-1 chart cells unpruned. We show that a simple discriminative classifier can learn with high accuracy which span-1 chart cells to close to phrase-level unary productions. Eliminating these unary productions from the search can have a large impact on downstream processing, depending on implementation details of the search. We apply our method to four parsing architectures and demonstrate how it is complementary to the cell-closing paradigm, as well as other pruning methods such as coarse-to-fine, agenda, and beam-search pruning.
same-paper 3 0.79939777 178 acl-2011-Interactive Topic Modeling
Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff
Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.
4 0.66275966 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
Author: Lonneke van der Plas ; Paola Merlo ; James Henderson
Abstract: Broad-coverage semantic annotations for training statistical learners are only available for a handful of languages. Previous approaches to cross-lingual transfer of semantic annotations have addressed this problem with encouraging results on a small scale. In this paper, we scale up previous efforts by using an automatic approach to semantic annotation that does not rely on a semantic ontology for the target language. Moreover, we improve the quality of the transferred semantic annotations by using a joint syntacticsemantic parser that learns the correlations between syntax and semantics of the target language and smooths out the errors from automatic transfer. We reach a labelled F-measure for predicates and arguments of only 4% and 9% points, respectively, lower than the upper bound from manual annotations.
5 0.66025078 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
Author: Joel Lang ; Mirella Lapata
Abstract: In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a final clustering of better quality. The method is simple, surprisingly effective, and allows to integrate linguistic knowledge transparently. By combining role induction with a rule-based component for argument identification we obtain an unsupervised end-to-end semantic role labeling system. Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin.
6 0.65936065 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
7 0.65148419 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing
8 0.64979041 182 acl-2011-Joint Annotation of Search Queries
9 0.64968717 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
10 0.64885873 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
11 0.64873737 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
12 0.64698076 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
13 0.64670944 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing
14 0.64657259 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
15 0.64605576 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
16 0.64583409 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
17 0.64464927 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
18 0.64440203 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
19 0.64317 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
20 0.64308995 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing