emnlp emnlp2011 emnlp2011-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Mimno ; David Blei
Abstract: Real document collections do not fit the independence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved.
Reference: text
sentIndex sentText sentNum sentScore
1 mimno @ c s princet on edu Abstract Real document collections do not fit the independence assumptions asserted by most statistical topic models, but how badly do they violate them? [sent-3, score-0.776]
2 We present a Bayesian method for measuring how well a topic model fits a corpus. [sent-4, score-0.41]
3 Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved. [sent-6, score-0.41]
4 1 Introduction Probabilistic topic models are a suite of machine learning algorithms that decompose a corpus into a set of topics and represent each document with a subset of those topics. [sent-7, score-0.67]
5 The inferred topics often correspond with the underlying themes of the analyzed collection, and the topic modeling algorithm orga- nizes the documents according to those themes. [sent-8, score-0.752]
6 Most topic models are evaluated by their predictive performance on held out data. [sent-9, score-0.517]
7 The idea is that topic models are fit to maximize the likelihood (or posterior probability) of a collection of documents, and so a good model is one that assigns high likelihood to a held out set (Blei et al. [sent-10, score-0.678]
8 But this evaluation is not in line with how topic models are frequently used. [sent-13, score-0.41]
9 In this paper, we develop and study new methods for evaluating topic models. [sent-19, score-0.41]
10 The key to a posterior predictive check is the discrepancy function. [sent-28, score-0.717]
11 While the model is often chosen for computational reasons, the discrepancy function might capture aspects of the data that are desirable but difficult to model. [sent-30, score-0.401]
12 In this work, we will design a discrepancy function to measure an independence assumption that is implicit in the modeling assumptions but is not enforced in the posterior. [sent-31, score-0.512]
13 We will embed this function in a posterior predictive check and use it to evaluate and visualize topic models in new ways. [sent-32, score-0.726]
14 LDA assumes that each observed word in a corpus is assigned to a topic, and that the words assigned to the same topic are drawn independently from the same multinomial distribution (Blei et al. [sent-36, score-0.705]
15 For each topic, we measure the whether this assumption holds by computing the mutual information between the words assigned to that topic and which document each word appeared in. [sent-38, score-0.6]
16 We embed this discrepancy in a PPC and study it in several ways. [sent-40, score-0.401]
17 First, we focus on topics that model their observations well; this helps separate interpretable topics from noisy topics (and “boilerplate” topics, which exhibit too little noise). [sent-41, score-0.579]
18 Finally, we validate this strategy by simulating data from a topic model, and assessing whether the PPC “accepts” the resulting data. [sent-45, score-0.41]
19 2 Probabilistic Topic Modeling Probabilistic topic models are statistical models of text that assume that a small number of distributions over words, called “topics,” are used to generate the observed documents. [sent-46, score-0.457]
20 One ofthe simplest topic models is latent Dirichlet allocation (LDA) (Blei et al. [sent-47, score-0.443]
21 In LDA, a set of K topics describes a corpus; each document exhibits the topics with different proportions. [sent-49, score-0.453]
22 For each word (a) Choose topic assignment zd,n (b) Choose word wd,n ∼ φzd,n . [sent-58, score-0.41]
23 This process articulates the statistical assumptions behind LDA: Each document is endowed with its own set of topic proportions θd, but the same set of topics φ1:K governs the whole collection. [sent-60, score-0.777]
24 Notice that the probability of a word is independent of its document θd given its topic assignment zd,n (i. [sent-61, score-0.477]
25 Two documents might have different o⊥ ⊥ver θall| probabilities of containing a word from the “vegetables” topic; however, all the words in the collection (regardless of their documents) drawn from that topic will be drawn from the same multinomial distribution. [sent-64, score-0.675]
26 Given a collection of documents, the problem is to compute the conditional distribution of the hidden variables—the topics φk, topic proportions θd, and topic assignments zd,n. [sent-66, score-1.098]
27 We focus on using them as an exploratory tool, where we assume that the topic model posterior provides a good decomposition of the corpus and that the topics provide good summaries of the corpus contents. [sent-73, score-0.78]
28 In complicated Bayesian models, such as topic models, Bayesian model checking can point to the parts of the posterior that better fit the observed data set and are more likely to suggest something meaningful about it. [sent-84, score-0.806]
29 In particular, we will develop posterior predictive checks (PPC) for topic models. [sent-85, score-0.733]
30 In a PPC, we specify a discrepancy function, which is a function of the data that measures an important property that we want the model to capture. [sent-86, score-0.401]
31 ”) An innovation in PPCs is the realized discrepancy function (Gelman et al. [sent-89, score-0.467]
32 Realized discrepancies induce a traditional discrepancy by marginalizing out the hidden variables. [sent-91, score-0.464]
33 In topic models, as we will see below, we use a realized dis- crepancy to factor the observations and to check specific components of the model that are discovered by the posterior. [sent-93, score-0.543]
34 1 A realized discrepancy for LDA Returning to LDA, we design a discrepancy function that checks the independence assumption of words given their topic assignments. [sent-95, score-1.353]
35 As we mentioned above, given the topic assignment z the word w should be independent of its document θ. [sent-96, score-0.477]
36 Consider a decomposition of a corpus from LDA, which assigns every observed word wd,n to a topic zd,n. [sent-97, score-0.457]
37 Now restrict attention to all the words assigned to the kth topic and form two random variables: W are the words assigned to the topic and D are the document indices of the words assigned to that topic. [sent-98, score-0.977]
38 (1) Where N(w, d, k) is the number of tokens of type Pw in topic k in document d, with N(w, k) = Pd N(w, dP, k), N(d, k) = Pw N(w, d, k), and NP(k) = Pw,d N(w, d, k). [sent-101, score-0.477]
39 counts of how many words assigned to topic k appeared in each document. [sent-110, score-0.44]
40 2 is the entropy—some topics are evenly distributed across many documents (high entropy); others are concentrated in fewer documents (low entropy). [sent-112, score-0.436]
41 The second term conditions this distribution on a particular word type w by normalizing the perdocument number of times w appeared in each document (in topic k). [sent-113, score-0.53]
42 ments than the overall distribution over documents for the topic and IMI(w, D|k) will be high. [sent-118, score-0.562]
43 We illustrate this discrepancy in Figure 1, which shows nine topics trained from the New York Times. [sent-119, score-0.594]
44 The discrepancy captures different kinds of structure in the topics. [sent-126, score-0.401]
45 The top left topic represents formulaic language, language that occurs verbatim in many documents. [sent-127, score-0.445]
46 ” Identifying repeated phrases is a common phenomenon in topic models. [sent-131, score-0.41]
47 Most words show lower than expected IMI, indicating that word use in this topic is less variable than data drawn from a multinomial distribution. [sent-132, score-0.579]
48 The middle-left topic is an example of a good topic, according to this discrepancy, which is related to Iraqi politics. [sent-133, score-0.41]
49 The bottom-left topic is an example of the opposite extreme from the top-left. [sent-134, score-0.41]
50 2 Posterior Predictive Checks for LDA Intuitively, the middle row of topics in Figure 1 are the sort of topics we look for in a model, while the top and bottom rows contain topics that are less useful. [sent-137, score-0.616]
51 For example, lower-ranked, less frequent words within a topic tend to have higher IMI scores than higher-ranked, more frequent words. [sent-148, score-0.41]
52 These words fit the multinomial assumption: any word assigned to this topic is equally likely to be Iraqi. [sent-160, score-0.635]
53 This topic combines many terms with only coincidental similarity, such as Mets pitcher Grant Roberts and the firm Kohlberg Kravis Roberts. [sent-163, score-0.41]
54 The formulaic Weekend topic has significantly lower than expected MI. [sent-167, score-0.479]
55 The Iraq 23105 0 Topic850 outcn21350 50 23105 0 Topic628 Topic87 −20 0 20 40 Deviance Figure 2: News: Observed topic scores (vertical lines) relative to replicated scores, rescaled so that replications have zero mean and unit variance. [sent-168, score-0.77]
56 The Weekend topic (top) has lower than expected MI. [sent-169, score-0.444]
57 For most topics the actual discrepancy is outside the range of any replicated discrepancies. [sent-172, score-0.782]
58 In their original formulation, PPCs prescribe computing a tail probability of a replicated discrepancy being greater than (or less than) the observed discrepancy under the posterior predictive distribution. [sent-173, score-1.38]
59 We then compute how many standard deviations the observed discrepancy is from the mean of the replicated discrepancies. [sent-179, score-0.736]
60 8 standard deviations below the mean replicated value, and thus has deviance of -31. [sent-182, score-0.531]
61 This matches our intuition that the former topic is more useful than the latter. [sent-187, score-0.41]
62 232 4 Searching for Systematic Deviations We demonstrated that the mutual information dis- crepancy function can detect violations of multinomial assumptions, in which instances of a term in a given topic are not independently distributed among documents. [sent-188, score-0.687]
63 LDA is the simplest generative topic model, and researchers have developed many variants of LDA that account for a variety of variables that can be found or measured with a corpus. [sent-192, score-0.41]
64 In this section, we show how we can use the mutual information discrepancy function of Equation 1 and PPCs to guide our choice in which topic model to fit. [sent-196, score-0.904]
65 The discrepancy functions are large when words appear more than expected in some groups and less than expected in others. [sent-198, score-0.469]
66 If we combine documents randomly in a meaningless grouping, such deviance should decrease, as differences between documents are “smoothed out. [sent-200, score-0.441]
67 ” If a grouping of documents shows equal or greater deviation, we can assume that that grouping is maintaining the underlying structure of the systematic deviation from the multinomial assumption, and that further modeling or visualization using that grouping might be useful. [sent-201, score-0.719]
68 1 PPCs for systematic discrepancy The idea is that the words assigned to a topic should be independent of both document and any other variable that might be associated with the document. [sent-203, score-0.957]
69 Three ways of grouping words in a topic from the New York Times. [sent-205, score-0.532]
70 If the topic modeling assumptions hold, the words are independent of both these variables. [sent-209, score-0.485]
71 If we see a significant discrepancy relative to a grouping defined by a metadata feature, this systematic variability suggests that we might want to take that feature into account in the model. [sent-210, score-0.613]
72 Lde lte N(w, g, k) = Pd N(w, d, k)Iγd=g, that is, the number ofwords of tPype w in topic k in documents in group g, and define Pthe other count variables similarly. [sent-212, score-0.509]
73 We can now substitute these group-specific counts for the documentspecific counts in the discrepancy function in Eq. [sent-213, score-0.401]
74 Note that the previous discrepancy functions are equivalent to a trivial grouping, in which each document is the only member of its own group. [sent-215, score-0.468]
75 In the following experiments we explore groupings by published volume, blog, preferred political candidate, and newspaper desk, and evaluate the effect of those groupings on the deviation between mean replicated values and observed values of those functions. [sent-216, score-0.646]
76 Figure 3 shows three groupings of words for the middle-left topic in Figure 1: by document, by month of publication (e. [sent-228, score-0.587]
77 We summarize each grouping by plotting the distribution of deviance scores for all topics. [sent-236, score-0.418]
78 We calculate the number of standard deviations between the mean replicated discrepancy and the actual discrepancy for each topic under three groupings. [sent-242, score-1.5]
79 Words with the largest MI from a topic on Iraq’s government are shown, with individual scores grouped by month. [sent-248, score-0.41]
80 This corpus has previously been considered in the context of aspect-based topic models (Ahmed and Xing, 2010) that assign distinct word distributions to liberal and conservative bloggers. [sent-255, score-0.41]
81 Figure 6 shows the distribution of standard deviations from the mean replicated value for a set of 150 topics grouped by document, blog, and preferred candidate. [sent-258, score-0.564]
82 Grouping by blogs appears to show greater deviance from mean replicated values than grouping by candidates, indicating that there is further structure in word choice beyond a simple liberal/conservative split. [sent-262, score-0.69]
83 To determine whether this particular assignment of documents to blogs is responsible for the difference in discrepancy functions or whether any such split would have greater deviance, we compared random groupings to the real groupings and recalculate the PPC. [sent-265, score-0.944]
84 We generated 10 such groupings by permuting document blog labels and another 10 by permuting document candidate labels, each time holding the topics fixed. [sent-266, score-0.611]
85 Grouping by prime minister shows greater average deviance than grouping by volumes, even though there are substantially fewer divisions. [sent-285, score-0.424]
86 Fig- ure 8 shows the most “mismatching” words for a topic with the most probable words ships, vessels, admiralty, iron, ship, navy, consistent with changes in naval technology during the Victorian era (that is, wooden ships to “iron clads”). [sent-288, score-0.532]
87 Words that occur more prominently in the topic (ships, vessels) are also variable, but more consistent across time. [sent-289, score-0.41]
88 ” In the previous sections, we have considered PPCs that explore variability within a topic on a per-word basis, measure discrepancy at the topic level, and compare deviance over all topics between groupings of documents. [sent-304, score-1.834]
89 When documents are generated from a multinomial topic model, PPCs should not detect systematic deviation. [sent-307, score-0.662]
90 0 p Figure 9: Replicating only documents with large allocation in the topic leads to more uniform p-values. [sent-335, score-0.58]
91 We then trained a topic model with the same hyperparameters and number of topics on each corpus, saving a Gibbs sampling state. [sent-343, score-0.603]
92 A histogram of p-values for 200 synthetic topics after 100 replications is shown in the left panel of Figure 9. [sent-348, score-0.418]
93 For some models, the posterior distribution is too close to the data, so all replicated values are close to the real value, leading to p-values clustered around 0. [sent-350, score-0.418]
94 6 Conclusions We have developed a Bayesian model checking method for probabilistic topic models. [sent-361, score-0.491]
95 Conditioned on their topic assignment, the words of the documents are independently and identically distributed by a multinomial distribution. [sent-362, score-0.658]
96 We developed a realized discrepancy function—the mutual information between words and document indices, conditioned on a topic—that checks this assumption. [sent-363, score-0.666]
97 We demonstrated that we can use this posterior predictive check to identify particular topics that fit the data, and particular topics that misfit the data in different ways. [sent-365, score-0.793]
98 Moreover, our method provides a new way to visualize topic models. [sent-366, score-0.41]
99 Finally, on simulated data, we demonstrated that PPCs with the mutual information discrepancy function can identify model fit and model misfit. [sent-369, score-0.585]
100 A joint topic and perspective model for ideological discourse. [sent-439, score-0.44]
wordName wordTfidf (topN-words)
[('topic', 0.41), ('discrepancy', 0.401), ('deviance', 0.243), ('topics', 0.193), ('imi', 0.19), ('replicated', 0.188), ('posterior', 0.177), ('ppc', 0.165), ('ppcs', 0.156), ('groupings', 0.136), ('lda', 0.123), ('grouping', 0.122), ('replications', 0.12), ('predictive', 0.107), ('multinomial', 0.104), ('iraq', 0.104), ('deviations', 0.1), ('documents', 0.099), ('mutual', 0.093), ('fit', 0.091), ('ships', 0.087), ('checking', 0.081), ('blogs', 0.078), ('blog', 0.078), ('blei', 0.077), ('synthetic', 0.075), ('assumptions', 0.075), ('gelman', 0.07), ('iron', 0.07), ('weekend', 0.07), ('document', 0.067), ('mimno', 0.067), ('political', 0.067), ('realized', 0.066), ('discrepancies', 0.063), ('parliament', 0.06), ('vessels', 0.06), ('greater', 0.059), ('bayesian', 0.058), ('ks', 0.054), ('distribution', 0.053), ('desk', 0.052), ('desks', 0.052), ('kurdish', 0.052), ('modeler', 0.052), ('rescaled', 0.052), ('sadr', 0.052), ('roberts', 0.05), ('themes', 0.05), ('systematic', 0.049), ('observed', 0.047), ('princeton', 0.047), ('distributed', 0.045), ('deviation', 0.042), ('xing', 0.042), ('variability', 0.041), ('month', 0.041), ('dirichlet', 0.04), ('checks', 0.039), ('uniform', 0.038), ('middle', 0.037), ('independence', 0.036), ('varies', 0.036), ('ahmed', 0.035), ('volumes', 0.035), ('asuncion', 0.035), ('bayarri', 0.035), ('boilerplate', 0.035), ('clads', 0.035), ('crepancy', 0.035), ('draper', 0.035), ('formulaic', 0.035), ('instantaneous', 0.035), ('iraqi', 0.035), ('maliki', 0.035), ('ministers', 0.035), ('orc', 0.035), ('permuting', 0.035), ('recalculate', 0.035), ('shiite', 0.035), ('speeches', 0.035), ('sunni', 0.035), ('topdocs', 0.035), ('topwordsdocs', 0.035), ('wooden', 0.035), ('expected', 0.034), ('cmu', 0.033), ('mi', 0.033), ('allocation', 0.033), ('lack', 0.032), ('check', 0.032), ('proportions', 0.032), ('drawn', 0.031), ('preferred', 0.03), ('arthur', 0.03), ('doyle', 0.03), ('ideological', 0.03), ('panel', 0.03), ('princet', 0.03), ('assigned', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 21 emnlp-2011-Bayesian Checking for Topic Models
Author: David Mimno ; David Blei
Abstract: Real document collections do not fit the independence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved.
2 0.35580391 101 emnlp-2011-Optimizing Semantic Coherence in Topic Models
Author: David Mimno ; Hanna Wallach ; Edmund Talley ; Miriam Leenders ; Andrew McCallum
Abstract: Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).
3 0.29972231 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions
Author: Weiwei Guo ; Mona Diab
Abstract: In this paper, we propose a novel topic model based on incorporating dictionary definitions. Traditional topic models treat words as surface strings without assuming predefined knowledge about word meaning. They infer topics only by observing surface word co-occurrence. However, the co-occurred words may not be semantically related in a manner that is relevant for topic coherence. Exploiting dictionary definitions explicitly in our model yields a better understanding of word semantics leading to better text modeling. We exploit WordNet as a lexical resource for sense definitions. We show that explicitly modeling word definitions helps improve performance significantly over the baseline for a text categorization task.
4 0.15601274 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
Author: Matthias Hartung ; Anette Frank
Abstract: This paper introduces an attribute selection task as a way to characterize the inherent meaning of property-denoting adjectives in adjective-noun phrases, such as e.g. hot in hot summer denoting the attribute TEMPERATURE, rather than TASTE. We formulate this task in a vector space model that represents adjectives and nouns as vectors in a semantic space defined over possible attributes. The vectors incorporate latent semantic information obtained from two variants of LDA topic models. Our LDA models outperform previous approaches on a small set of 10 attributes with considerable gains on sparse representations, which highlights the strong smoothing power of LDA models. For the first time, we extend the attribute selection task to a new data set with more than 200 classes. We observe that large-scale attribute selection is a hard problem, but a subset of attributes performs robustly on the large scale as well. Again, the LDA models outperform the VSM baseline.
5 0.089411214 107 emnlp-2011-Probabilistic models of similarity in syntactic context
Author: Diarmuid O Seaghdha ; Anna Korhonen
Abstract: This paper investigates novel methods for incorporating syntactic information in probabilistic latent variable models of lexical choice and contextual similarity. The resulting models capture the effects of context on the interpretation of a word and in particular its effect on the appropriateness of replacing that word with a potentially related one. Evaluating our techniques on two datasets, we report performance above the prior state of the art for estimating sentence similarity and ranking lexical substitutes.
6 0.087509878 114 emnlp-2011-Relation Extraction with Relation Topics
7 0.077900603 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation
8 0.077507943 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model
9 0.071397461 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics
10 0.063997947 130 emnlp-2011-Summarize What You Are Interested In: An Optimization Framework for Interactive Personalized Summarization
11 0.063164778 128 emnlp-2011-Structured Relation Discovery using Generative Models
12 0.058333829 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
13 0.053873103 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
14 0.053211272 144 emnlp-2011-Unsupervised Learning of Selectional Restrictions and Detection of Argument Coercions
15 0.047813069 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
16 0.047670003 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association
17 0.046840478 125 emnlp-2011-Statistical Machine Translation with Local Language Models
18 0.044059083 14 emnlp-2011-A generative model for unsupervised discovery of relations and argument classes from clinical texts
19 0.043564782 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
20 0.043301456 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article
topicId topicWeight
[(0, 0.164), (1, -0.162), (2, -0.179), (3, -0.281), (4, -0.084), (5, 0.431), (6, 0.112), (7, -0.032), (8, 0.029), (9, -0.101), (10, -0.019), (11, 0.075), (12, -0.015), (13, 0.013), (14, 0.063), (15, -0.104), (16, -0.005), (17, -0.14), (18, -0.011), (19, -0.074), (20, 0.047), (21, 0.087), (22, -0.009), (23, 0.089), (24, 0.012), (25, 0.044), (26, -0.151), (27, -0.022), (28, 0.005), (29, -0.014), (30, 0.006), (31, -0.063), (32, 0.003), (33, -0.021), (34, -0.039), (35, -0.037), (36, -0.052), (37, 0.097), (38, 0.011), (39, 0.014), (40, -0.07), (41, 0.007), (42, 0.003), (43, 0.052), (44, 0.087), (45, -0.033), (46, -0.036), (47, -0.048), (48, 0.01), (49, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.98591202 21 emnlp-2011-Bayesian Checking for Topic Models
Author: David Mimno ; David Blei
Abstract: Real document collections do not fit the independence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved.
2 0.97657597 101 emnlp-2011-Optimizing Semantic Coherence in Topic Models
Author: David Mimno ; Hanna Wallach ; Edmund Talley ; Miriam Leenders ; Andrew McCallum
Abstract: Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).
3 0.87066305 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions
Author: Weiwei Guo ; Mona Diab
Abstract: In this paper, we propose a novel topic model based on incorporating dictionary definitions. Traditional topic models treat words as surface strings without assuming predefined knowledge about word meaning. They infer topics only by observing surface word co-occurrence. However, the co-occurred words may not be semantically related in a manner that is relevant for topic coherence. Exploiting dictionary definitions explicitly in our model yields a better understanding of word semantics leading to better text modeling. We exploit WordNet as a lexical resource for sense definitions. We show that explicitly modeling word definitions helps improve performance significantly over the baseline for a text categorization task.
4 0.54722226 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
Author: Matthias Hartung ; Anette Frank
Abstract: This paper introduces an attribute selection task as a way to characterize the inherent meaning of property-denoting adjectives in adjective-noun phrases, such as e.g. hot in hot summer denoting the attribute TEMPERATURE, rather than TASTE. We formulate this task in a vector space model that represents adjectives and nouns as vectors in a semantic space defined over possible attributes. The vectors incorporate latent semantic information obtained from two variants of LDA topic models. Our LDA models outperform previous approaches on a small set of 10 attributes with considerable gains on sparse representations, which highlights the strong smoothing power of LDA models. For the first time, we extend the attribute selection task to a new data set with more than 200 classes. We observe that large-scale attribute selection is a hard problem, but a subset of attributes performs robustly on the large scale as well. Again, the LDA models outperform the VSM baseline.
5 0.3906934 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics
Author: Joseph Reisinger ; Raymond Mooney
Abstract: Context-dependent word similarity can be measured over multiple cross-cutting dimensions. For example, lung and breath are similar thematically, while authoritative and superficial occur in similar syntactic contexts, but share little semantic similarity. Both of these notions of similarity play a role in determining word meaning, and hence lexical semantic models must take them both into account. Towards this end, we develop a novel model, Multi-View Mixture (MVM), that represents words as multiple overlapping clusterings. MVM finds multiple data partitions based on different subsets of features, subject to the marginal constraint that feature subsets are distributed according to Latent Dirich- let Allocation. Intuitively, this constraint favors feature partitions that have coherent topical semantics. Furthermore, MVM uses soft feature assignment, hence the contribution of each data point to each clustering view is variable, isolating the impact of data only to views where they assign the most features. Through a series of experiments, we demonstrate the utility of MVM as an inductive bias for capturing relations between words that are intuitive to humans, outperforming related models such as Latent Dirichlet Allocation.
6 0.33403715 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation
7 0.26965287 107 emnlp-2011-Probabilistic models of similarity in syntactic context
8 0.26791802 114 emnlp-2011-Relation Extraction with Relation Topics
9 0.2625291 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article
10 0.25788927 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model
11 0.24461941 130 emnlp-2011-Summarize What You Are Interested In: An Optimization Framework for Interactive Personalized Summarization
12 0.24294722 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association
13 0.23281482 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
14 0.22782198 144 emnlp-2011-Unsupervised Learning of Selectional Restrictions and Detection of Argument Coercions
15 0.21476024 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
16 0.19494957 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
17 0.18404761 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information
18 0.18308936 88 emnlp-2011-Linear Text Segmentation Using Affinity Propagation
19 0.18086568 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources
20 0.17753091 39 emnlp-2011-Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
topicId topicWeight
[(23, 0.065), (36, 0.018), (37, 0.018), (45, 0.648), (53, 0.013), (54, 0.017), (57, 0.012), (62, 0.011), (64, 0.02), (66, 0.022), (79, 0.022), (82, 0.013), (96, 0.033), (98, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.98071527 21 emnlp-2011-Bayesian Checking for Topic Models
Author: David Mimno ; David Blei
Abstract: Real document collections do not fit the independence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved.
2 0.97078353 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association
Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman
Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.
3 0.94606066 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP
Author: Amit Goyal ; Hal Daume III
Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).
4 0.90248567 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion
Author: Swapna Gottipati ; Jing Jiang
Abstract: In this paper we present a novel approach to entity linking based on a statistical language model-based information retrieval with query expansion. We use both local contexts and global world knowledge to expand query language models. We place a strong emphasis on named entities in the local contexts and explore a positional language model to weigh them differently based on their distances to the query. Our experiments on the TAC-KBP 2010 data show that incorporating such contextual information indeed aids in disambiguating the named entities and consistently improves the entity linking performance. Compared with the official results from KBP 2010 participants, our system shows competitive performance.
5 0.86193687 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
Author: Emily M. Bender ; Dan Flickinger ; Stephan Oepen ; Yi Zhang
Abstract: In order to obtain a fine-grained evaluation of parser accuracy over naturally occurring text, we study 100 examples each of ten reasonably frequent linguistic phenomena, randomly selected from a parsed version of the English Wikipedia. We construct a corresponding set of gold-standard target dependencies for these 1000 sentences, operationalize mappings to these targets from seven state-of-theart parsers, and evaluate the parsers against this data to measure their level of success in identifying these dependencies.
6 0.74479657 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions
7 0.71997166 101 emnlp-2011-Optimizing Semantic Coherence in Topic Models
8 0.69640023 64 emnlp-2011-Harnessing different knowledge sources to measure semantic relatedness under a uniform model
9 0.6785993 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics
10 0.64281857 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
11 0.63877982 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
12 0.61894858 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models
13 0.60988796 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
14 0.58065552 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
15 0.57746303 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms
16 0.56471372 91 emnlp-2011-Literal and Metaphorical Sense Identification through Concrete and Abstract Context
17 0.56207484 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation
18 0.55807936 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources
19 0.55736244 11 emnlp-2011-A Simple Word Trigger Method for Social Tag Suggestion
20 0.55663449 107 emnlp-2011-Probabilistic models of similarity in syntactic context