nips nips2009 nips2009-205 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach
Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. [sent-4, score-0.819]
2 In this paper, we explore several classes of structured priors for topic models. [sent-5, score-0.557]
3 We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. [sent-6, score-0.949]
4 The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. [sent-8, score-1.026]
5 Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. [sent-9, score-0.602]
6 In practice, users of topic models are typically faced with two immediate problems: First, extremely common words tend to dominate all topics. [sent-12, score-0.496]
7 Standard practice is to remove “stop words” before modeling using a manually constructed, corpus-specific stop word list and to optimize T by either analyzing probabilities of held-out documents or resorting to a more complicated nonparametric model. [sent-14, score-0.315]
8 Additionally, there has been relatively little work in the machine learning literature on the structure of the prior distributions used in LDA: most researchers simply use symmetric Dirichlet priors with heuristically set concentration parameters. [sent-15, score-0.529]
9 In this paper, we demonstrate that practical implementation issues (handling stop words, setting the number of topics) and theoretical issues involving the structure of Dirichlet priors are intimately related. [sent-18, score-0.291]
10 We start by exploring the effects of classes of hierarchically structured Dirichlet priors over the document–topic distributions and topic–word distributions in LDA. [sent-19, score-0.25]
11 info/ 1 a symmetric Dirichlet prior over the topic–word distributions results in significantly better model performance, measured both in terms of the probability of held-out documents and in the quality of inferred topics. [sent-21, score-0.381]
12 Finally, we show that using optimized Dirichlet hyperparameters results in dramatically improved consistency in topic usage as T is increased. [sent-24, score-0.516]
13 Since the priors we advocate (an asymmetric Dirichlet over the document–topic distributions and a symmetric Dirichlet over the topic–word distributions) have significant modeling benefits and can be implemented using highly efficient algorithms, we recommend them as a new standard for LDA. [sent-26, score-0.677]
14 2 Latent Dirichlet Allocation LDA is a generative topic model for documents W = {w(1) , w(2) , . [sent-27, score-0.507]
15 (1) Each document, indexed by d, has a document-specific distribution over topics θ d . [sent-39, score-0.268]
16 θ D } is also assumed to be a symmetric Dirichlet, this time with concentration param(d) eter α. [sent-43, score-0.262]
17 The tokens in every document w(d) = {wn }Nd are associated with corresponding topic n=1 (d) assignments z (d) = {zn }Nd , drawn i. [sent-44, score-0.756]
18 For real-world data, documents W are observed, while the corresponding topic assignments Z are unobserved. [sent-55, score-0.612]
19 Variational methods [3, 16] and MCMC methods [7] are both effective at inferring the latent topic assignments Z. [sent-56, score-0.52]
20 We use MCMC methods throughout this paper—specifically Gibbs sampling [5]—since the internal structure of hierarchical Dirichlet priors are typically inferred using a Gibbs sampling algorithm, which can be easily interleaved with Gibbs updates for Z given W. [sent-59, score-0.348]
21 3 Priors for LDA The previous section outlined LDA as it is most commonly used—namely with symmetric Dirichlet priors over Θ and Φ with fixed concentration parameters α and β, respectively. [sent-61, score-0.404]
22 The simplest way to vary this choice of prior for either Θ or Φ is to infer the relevant concentration parameter from data, either by computing a MAP estimate [1] or by using an MCMC algorithm such as slice sampling [13]. [sent-62, score-0.289]
23 (c) Generating {zn }4 = (t, t , t, t) from the n=1 (d) 4 (d ) asymmetric, predictive distribution for document d; (f) generating {zn }n=1 = (t, t , t, t) and {zn }4 = n=1 (t , t , t , t ) from the asymmetric, hierarchical predictive distributions for documents d and d , respectively. [sent-65, score-0.397]
24 Alternatively, the uniform base measures in the Dirichlet priors over Θ and Φ can be replaced with nonuniform base measures m and n, respectively. [sent-66, score-0.271]
25 1, we describe the effects on the document-specific conditional posterior distributions, or predictive distributions, of replacing u with a fixed asymmetric (i. [sent-69, score-0.361]
26 2, we then treat m as unknown, and take a fully Bayesian approach, giving m a Dirichlet prior (with a uniform base measure and concentration parameter α ) and integrating it out. [sent-73, score-0.346]
27 Nd + α (4) (d) If topic t does not occur in z (d) , then Nt|d will be zero, and the probability of generating zNd +1 = t will be mt . [sent-76, score-0.476]
28 In other words, under an asymmetric prior, Nt|d is smoothed with a topic-specific quantity αmt . [sent-77, score-0.29]
29 Consequently, different topics can be a priori more or less probable in all documents. [sent-78, score-0.292]
30 One way of describing the process of generating from (4) is to say that generating a topic assignment (d) (d) zn is equivalent to setting the value of zn to the the value of some document-specific draw from m. [sent-79, score-0.932]
31 Next, z2 is drawn by either selecting γ1 , with probability proportional to the number of topic assignments that have been previously “matched” to γ1 , or a new draw from m, with probability proportional to α. [sent-84, score-0.673]
32 The next topic assignment is drawn in the same way: existing draws γ1 and γ2 are selected with probabilities proportional to the numbers of topic assignments to which they have previously been matched, while with probability proportional (d) (d) to α, z3 is matched to a new draw from m. [sent-86, score-1.258]
33 In general, the probability of a new topic assignment being assigned the value of an (i) existing document-specific draw γi from m is proportional to Nd , the number of topic assignments 3 previously matched to γi . [sent-88, score-1.191]
34 The predictive probability of topic t in document d is therefore I i=1 (d) P (zNd +1 = t | Z, αm) = (i) Nd δ (γi − t) + αmt , Nd + α (5) where I is the current number of draws from m for document d. [sent-89, score-0.767]
35 Since every topic assignment is (i) I matched to a draw from m, i=1 Nd δ (γi − t) = Nt|d . [sent-90, score-0.595]
36 We take a fully Bayesian approach, and give m a symmetric Dirichlet prior with concentration parameter α (as shown in figures 1d and 1e). [sent-94, score-0.379]
37 Giving m a symmetric Dirichlet prior and integrating it out has the effect of replacing m in (5) with a “global” P´ lya conditional distribution, shared by the document-specific predictive distributions. [sent-97, score-0.286]
38 o Figure 1f depicts the process of drawing eight topic assignments—four for document d and four for document d . [sent-98, score-0.663]
39 At this level, γi treated as if it were a topic assignment, and assigned the value of an existing global draw γj with probability proportional to the number of document-level draws previously matched to γj , and to a new global draw, from u, with probability proportional to α . [sent-101, score-0.789]
40 Since the internal draws at the document level are treated as topic assignments the global level, there is a path from every topic assignment to u, via the internal draws. [sent-102, score-1.327]
41 The quantity N (j) is the total number of document-level internal draws matched to global internal draw γj . [sent-104, score-0.33]
42 Since some topic ˆ assignments will be matched to existing document-level draws, d δ (Nt|d > 0) ≤ Nt ≤ Nt , where d δ (Nt|d > 0) is the number of unique documents in Z in which topic t occurs. [sent-105, score-1.088]
43 In other words, as α → ∞ the hierarchical, asymcounts N tN metric Dirichlet prior approaches a symmetric Dirichlet prior with concentration parameter α. [sent-107, score-0.432]
44 Only the value of each topic assignment is known, and hence Nt|d for each topic t and document d. [sent-109, score-1.002]
45 In order to compute the conditional posterior distribution for each topic assignment ˆ (needed to resample Z) it is necessary to infer Nt for each topic t. [sent-110, score-0.878]
46 Removing zn = t from the model prior to resampling its value consists of decrementing Nt|d and removing its current path to u. [sent-113, score-0.297]
47 Similarly, adding a newly sampled value (d) (d) zn = t into the model consists of incrementing Nt |d and sampling a new path from zn to u. [sent-114, score-0.389]
48 4 Comparing Priors for LDA To investigate the effects of the priors over Θ and Φ, we compared the four combinations of symmetric and asymmetric Dirichlets shown in figure 1: symmetric priors over both Θ and Φ (denoted 4 α α' 150 150 Frequency 0 50 −680000 -6. [sent-115, score-0.84]
49 28 0 5000 150 140 400 3000 Frequency 1000 0 0 100 log β' β −760000 Log Probability 0 50 Frequency Patent abstracts 50 topics Iteration 60 70 80 90 10 30 50 70 -8. [sent-131, score-0.401]
50 SS), a symmetric prior over Θ and an asymmetric prior over Φ (denoted SA), an asymmetric prior over Θ and a symmetric prior over Φ (denoted AS), and asymmetric priors over both Θ and Φ (denoted AA). [sent-151, score-1.59]
51 Each combination was used to model three collections of documents: patent abstracts about carbon nanotechnology, New York Times articles, and 20 Newsgroups postings. [sent-152, score-0.36]
52 In order to stress each combination of priors with respect to skewed distributions over word frequencies, stop words were not removed from the patent abstracts. [sent-154, score-0.667]
53 The concentration parameters for each model (denoted by Ω) were given broad Gamma priors and inferred using slice sampling [13]. [sent-157, score-0.391]
54 ) There are two distinct patterns: models with an asymmetric prior over Θ (AS and AA; red and black, respectively) perform very similarly, while models with a symmetric prior over Θ (SS and SA; blue and green, respectively) also perform similarly, with significantly worse performance than AS and AA. [sent-161, score-0.579]
55 The fully asymmetric model, AA, is inconsistent, matching AS on the patents and 20 Newsgroups but doing poorly on NYT. [sent-165, score-0.365]
56 These results are shown in figure 3a, and exhibit a similar pattern to the results in figure 2a—the best-performing models are those with an asymmetric priors over Θ. [sent-169, score-0.432]
57 2, as α or β grows large relative to t Nt ˆ or w Nw , an asymmetric Dirichlet prior approaches a symmetric Dirichlet with concentration parameter α or β. [sent-172, score-0.637]
58 042 Asymmetric β a field the emission and carbon is the carbon catalyst a nanotubes a the of substrate to material on carbon single wall the nanotubes the a probe tip and of to Asymmetric α Nats / token -6. [sent-184, score-0.478]
59 029 the a of to and is in and are of for in as such a carbon material as structure nanotube diameter swnt about nm than fiber swnts compositions polymers polymer contain (a) (b) Figure 3: (a) Log probability of held-out documents (patent abstracts). [sent-199, score-0.248]
60 (b) αmt values and the most probable words for topics obtained with T = 50. [sent-202, score-0.349]
61 For each model, topics were ranked according to usage and the topics at ranks 1, 5, 10, 20 and 30 are shown. [sent-203, score-0.573]
62 AS and AA are robust to skewed word frequency distributions and tend to sequester stop words in their own topics. [sent-204, score-0.406]
63 In other words, given the values of β , the prior over Φ is effectively a symmetric prior over Φ with concentration parameter β. [sent-206, score-0.432]
64 These results demonstrate that even when the model can use an asymmetric prior over Φ, a symmetric prior gives better performance. [sent-207, score-0.579]
65 The remaining topics are relatively unaffected by stop words. [sent-211, score-0.417]
66 Creating corpus-specific stop word lists is seen as an unpleasant but necessary chore in topic modeling. [sent-212, score-0.66]
67 Also, for many specialized corpora, once standard stop words have been removed, there are still other words that occur with very high probability, such as “model,” “data,” and “results” in machine learning literature, but are not technically stop words. [sent-213, score-0.46]
68 If LDA cannot handle such words in an appropriate fashion then they must be treated as stop words and removed, despite the fact that they play meaningful semantic roles. [sent-214, score-0.333]
69 The robustness of AS to stop words has implications for HMM-LDA [8] which models stop words using a hidden Markov model and “content” words using LDA, at considerable computational cost. [sent-215, score-0.574]
70 AS achieves the same robustness to stop words much more efficiently. [sent-216, score-0.263]
71 We demonstrate that AS is capable of learning meaningful topics even with no stop word removal. [sent-218, score-0.491]
72 Wallach [19] compared several methods for jointly the maximum likelihood concentration parameter and asymmetric base measure for a Dirichlet–multinomial model. [sent-224, score-0.48]
73 04 Table 3: Average VI distances between multiple runs of each model with T = 50 on (left) patent abstracts and (right) 20 newsgroups. [sent-283, score-0.283]
74 α and β) for the patent abstracts using fully Bayesian AS with T = 25 took over four hours, while 5000 Gibbs sampling iterations (including hyperparameter optimization) took under 30 minutes. [sent-286, score-0.348]
75 In order to establish that optimizing m is a good approximation to integrating it out, we computed log P (W, Z | Ω) and the log probability of held-out documents for fully Bayesian AS, optimized AS (denoted ASO) and as a baseline SS (see table 2). [sent-287, score-0.261]
76 Any set of topic assignments can be thought of as partition of the corresponding tokens into T topics. [sent-293, score-0.632]
77 In order to measure similarity between two sets of topic assignments Z and Z for W, we can compute the distance between these partitions using variation of information (VI) [11, 6] (see suppl. [sent-294, score-0.58]
78 VI has several attractive properties: it is a proper distance metric, it is invariant to permutations of the topic labels, and it can be computed in O (N + T T ) time, i. [sent-297, score-0.415]
79 , time that is linear in the number of tokens and the product of the numbers of topics in Z and Z . [sent-299, score-0.38]
80 For each model (AS, ASO and SS), we calculated the average VI distance between all 10 unique pairs of topic assignments from the 5 Gibbs runs for that model, giving a measure of within-model consistency. [sent-300, score-0.544]
81 We also calculated the between-model VI distance for each pair of models, averaged over all 25 unique pairs of topic assignments for that pair. [sent-301, score-0.52]
82 6 Effect on Selecting the Number of Topics Selecting the number of topics T is one of the most problematic modeling choices in finite topic modeling. [sent-305, score-0.683]
83 Ideally, if LDA has sufficient topics to model W well, the assignments of tokens to topics should be relatively invariant to an increase in T —i. [sent-309, score-0.753]
84 For example, if ten topics is sufficient to accurately model the data, then increasing the number of topics to twenty shouldn’t significantly affect inferred topic assignments. [sent-312, score-1.019]
85 Figure 4a shows the average VI distance between topic assignments (for the patent abstracts) inferred by models with T = 25 and models with T ∈ {50, 75, 100}. [sent-314, score-0.713]
86 0 Clustering distance from T=25 50 topics 75 topics 100 topics 50 topics 75 topics 100 topics Topics (a) (b) Figure 4: (a) Topic consistency measured by average VI distance from models with T = 25. [sent-333, score-1.608]
87 (b) Assignments of tokens (patent abstracts) allocated to the largest topic in a 25 topic model, as T increases. [sent-335, score-0.942]
88 For AS, the topic is relatively intact, even at T = 100: 80% of tokens assigned to the topic at T = 25 are assigned to seven topics. [sent-336, score-1.012]
89 For SS, the topic has been subdivided across many more topics. [sent-337, score-0.415]
90 much more stable (smaller average VI distances) than SS and SA at 50 topics and remain so as T increases: even at 100 topics, AS has a smaller VI distance to a 25 topic model than SS at 50 topics. [sent-338, score-0.683]
91 Figure 4b provides intuition for this difference: for AS, the tokens assigned to the largest topic at T = 25 remain within a small number of topics as T is increased, while for SS, topic usage is more uniform and increasing T causes the tokens to be divided among many more topics. [sent-339, score-1.394]
92 These results suggest that for AS, new topics effectively “nibble away” at existing topics, rather than splitting them more uniformly. [sent-340, score-0.268]
93 We therefore argue that the risk of using too many topics is lower than the risk of using too few, and that practitioners should be comfortable using larger values of T . [sent-341, score-0.289]
94 The primary assumption underlying topic modeling is that a topic should capture semantically-related word co-occurrences. [sent-344, score-0.904]
95 An asymmetric prior over Φ is therefore a bad idea: the base measure will reflect corpus-wide word usage statistics, and a priori, all topics will exhibit those statistics too. [sent-347, score-0.801]
96 A symmetric prior over Φ only makes a prior statement (determined by the concentration parameter β) about whether topics will have more sparse or more uniform distributions over words, so the topics are free to be as distinct and specialized as is necessary. [sent-348, score-1.008]
97 These assumptions lead naturally to the combination of priors that we have empirically identified as superior: an asymmetric Dirichlet prior over Θ that serves to share commonalities across documents and a symmetric Dirichlet prior over Φ that serves to avoid conflicts between topics. [sent-352, score-0.813]
98 Since these priors can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend them as a new standard for LDA. [sent-353, score-0.244]
99 Analyzing entities and topics in news articles using statistical topic models. [sent-447, score-0.723]
100 Efficient methods for topic model inference on streaming document collections. [sent-491, score-0.569]
wordName wordTfidf (topN-words)
[('topic', 0.415), ('aso', 0.316), ('asymmetric', 0.29), ('topics', 0.268), ('ss', 0.245), ('dirichlet', 0.212), ('zn', 0.177), ('nt', 0.15), ('aa', 0.149), ('stop', 0.149), ('patent', 0.148), ('concentration', 0.143), ('priors', 0.142), ('sa', 0.128), ('document', 0.124), ('symmetric', 0.119), ('lda', 0.118), ('tokens', 0.112), ('abstracts', 0.111), ('assignments', 0.105), ('carbon', 0.101), ('documents', 0.092), ('prior', 0.085), ('znd', 0.082), ('words', 0.081), ('wn', 0.078), ('word', 0.074), ('nyt', 0.072), ('wallach', 0.072), ('draw', 0.071), ('draws', 0.061), ('matched', 0.061), ('partitions', 0.06), ('internal', 0.058), ('dirichlets', 0.057), ('gibbs', 0.057), ('vi', 0.052), ('advocate', 0.049), ('nanotubes', 0.049), ('assignment', 0.048), ('base', 0.047), ('inferred', 0.045), ('newsgroups', 0.045), ('predictive', 0.043), ('patents', 0.043), ('mimno', 0.043), ('proportional', 0.041), ('distributions', 0.04), ('articles', 0.04), ('mt', 0.039), ('gure', 0.039), ('integrating', 0.039), ('usage', 0.037), ('recommend', 0.037), ('resampling', 0.035), ('negligible', 0.035), ('sampling', 0.035), ('nonuniform', 0.035), ('assigned', 0.035), ('hyperparameters', 0.034), ('hierarchical', 0.033), ('skewed', 0.033), ('robustness', 0.033), ('catalyst', 0.033), ('libraries', 0.033), ('nanotube', 0.033), ('pachinko', 0.033), ('fully', 0.032), ('paths', 0.031), ('asuncion', 0.031), ('inference', 0.03), ('optimized', 0.03), ('frequency', 0.029), ('bayesian', 0.029), ('effects', 0.028), ('inconsistent', 0.028), ('mcmc', 0.027), ('advocated', 0.026), ('slice', 0.026), ('runs', 0.024), ('digital', 0.024), ('green', 0.024), ('optimizing', 0.024), ('priori', 0.024), ('twenty', 0.023), ('blei', 0.022), ('hyperparameter', 0.022), ('security', 0.022), ('emission', 0.022), ('material', 0.022), ('treated', 0.022), ('log', 0.022), ('generating', 0.022), ('lists', 0.022), ('consistently', 0.022), ('practitioners', 0.021), ('conjugacy', 0.021), ('allocation', 0.021), ('global', 0.021), ('yes', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 205 nips-2009-Rethinking LDA: Why Priors Matter
Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach
Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1
2 0.28336939 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Author: Chong Wang, David M. Blei
Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1
3 0.20984726 204 nips-2009-Replicated Softmax: an Undirected Topic Model
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.
4 0.19628935 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
Author: Richard Socher, Samuel Gershman, Per Sederberg, Kenneth Norman, Adler J. Perotte, David M. Blei
Abstract: We develop a probabilistic model of human memory performance in free recall experiments. In these experiments, a subject first studies a list of words and then tries to recall them. To model these data, we draw on both previous psychological research and statistical topic models of text documents. We assume that memories are formed by assimilating the semantic meaning of studied words (represented as a distribution over topics) into a slowly changing latent context (represented in the same space). During recall, this context is reinstated and used as a cue for retrieving studied words. By conceptualizing memory retrieval as a dynamic latent variable model, we are able to use Bayesian inference to represent uncertainty and reason about the cognitive processes underlying memory. We present a particle filter algorithm for performing approximate posterior inference, and evaluate our model on the prediction of recalled words in experimental data. By specifying the model hierarchically, we are also able to capture inter-subject variability. 1
5 0.16637768 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model
Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda
Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.
6 0.15527919 96 nips-2009-Filtering Abstract Senses From Image Search Results
7 0.14685234 226 nips-2009-Spatial Normalized Gamma Processes
8 0.12392703 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
9 0.12278223 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
10 0.11571376 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains
11 0.10916465 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
12 0.10584427 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
14 0.10208603 256 nips-2009-Which graphical models are difficult to learn?
15 0.081867747 190 nips-2009-Polynomial Semantic Indexing
16 0.078754708 114 nips-2009-Indian Buffet Processes with Power-law Behavior
17 0.073320508 90 nips-2009-Factor Modeling for Advertisement Targeting
18 0.070217095 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process
19 0.068472393 115 nips-2009-Individuation, Identification and Object Discovery
20 0.068041049 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
topicId topicWeight
[(0, -0.182), (1, -0.116), (2, -0.105), (3, -0.23), (4, 0.141), (5, -0.261), (6, -0.035), (7, -0.017), (8, -0.052), (9, 0.24), (10, -0.074), (11, -0.033), (12, 0.066), (13, 0.116), (14, 0.119), (15, -0.024), (16, -0.224), (17, -0.072), (18, -0.033), (19, 0.07), (20, 0.029), (21, 0.058), (22, 0.045), (23, -0.112), (24, -0.067), (25, -0.012), (26, 0.034), (27, -0.135), (28, -0.019), (29, -0.032), (30, 0.042), (31, 0.027), (32, 0.063), (33, 0.003), (34, -0.014), (35, 0.009), (36, -0.013), (37, -0.034), (38, -0.038), (39, -0.023), (40, 0.064), (41, -0.101), (42, -0.037), (43, 0.026), (44, -0.076), (45, -0.11), (46, 0.02), (47, 0.071), (48, -0.047), (49, -0.076)]
simIndex simValue paperId paperTitle
same-paper 1 0.97924787 205 nips-2009-Rethinking LDA: Why Priors Matter
Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach
Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1
2 0.92360353 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Author: Chong Wang, David M. Blei
Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1
3 0.80056101 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model
Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda
Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.
4 0.79811698 204 nips-2009-Replicated Softmax: an Undirected Topic Model
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.
5 0.6430977 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
Author: Feng Yan, Ningyi Xu, Yuan Qi
Abstract: The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1
6 0.60286516 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
7 0.57202441 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
8 0.56951439 226 nips-2009-Spatial Normalized Gamma Processes
9 0.50480253 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models
10 0.46901175 90 nips-2009-Factor Modeling for Advertisement Targeting
11 0.46286383 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
12 0.44463307 96 nips-2009-Filtering Abstract Senses From Image Search Results
13 0.43800023 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
14 0.40250552 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
15 0.37093788 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains
16 0.31628415 256 nips-2009-Which graphical models are difficult to learn?
17 0.30895907 190 nips-2009-Polynomial Semantic Indexing
18 0.3077189 114 nips-2009-Indian Buffet Processes with Power-law Behavior
19 0.28539938 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution
20 0.27993724 152 nips-2009-Measuring model complexity with the prior predictive
topicId topicWeight
[(7, 0.011), (24, 0.033), (25, 0.066), (27, 0.264), (35, 0.045), (36, 0.074), (39, 0.045), (58, 0.042), (61, 0.028), (71, 0.183), (81, 0.019), (86, 0.077), (91, 0.017)]
simIndex simValue paperId paperTitle
1 0.84733582 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution
Author: Cosmin Bejan, Matthew Titsworth, Andrew Hickl, Sanda Harabagiu
Abstract: We present a sequence of unsupervised, nonparametric Bayesian models for clustering complex linguistic objects. In this approach, we consider a potentially infinite number of features and categorical outcomes. We evaluated these models for the task of within- and cross-document event coreference on two corpora. All the models we investigated show significant improvements when compared against an existing baseline for this task.
same-paper 2 0.83126289 205 nips-2009-Rethinking LDA: Why Priors Matter
Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach
Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1
3 0.75419635 63 nips-2009-DUOL: A Double Updating Approach for Online Learning
Author: Peilin Zhao, Steven C. Hoi, Rong Jin
Abstract: In most online learning algorithms, the weights assigned to the misclassified examples (or support vectors) remain unchanged during the entire learning process. This is clearly insufficient since when a new misclassified example is added to the pool of support vectors, we generally expect it to affect the weights for the existing support vectors. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short. Instead of only assigning a fixed weight to the misclassified example received in current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be significantly improved by the proposed online learning method. Encouraging experimental results show that the proposed technique is in general considerably more effective than the state-of-the-art online learning algorithms. 1
4 0.65813786 56 nips-2009-Conditional Neural Fields
Author: Jian Peng, Liefeng Bo, Jinbo Xu
Abstract: Conditional random fields (CRF) are widely used for sequence labeling such as natural language processing and biological sequence analysis. Most CRF models use a linear potential function to represent the relationship between input features and output. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and output is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input and output we propose a new conditional probabilistic graphical model, Conditional Neural Fields (CNF), for sequence labeling. CNF extends CRF by adding one (or possibly more) middle layer between input and output. The middle layer consists of a number of gate functions, each acting as a local neuron or feature extractor to capture the nonlinear relationship between input and output. Therefore, conceptually CNF is much more expressive than CRF. Experiments on two widely-used benchmarks indicate that CNF performs significantly better than a number of popular methods. In particular, CNF is the best among approximately 10 machine learning methods for protein secondary structure prediction and also among a few of the best methods for handwriting recognition.
5 0.65137321 11 nips-2009-A General Projection Property for Distribution Families
Author: Yao-liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvári
Abstract: Surjectivity of linear projections between distribution families with fixed mean and covariance (regardless of dimension) is re-derived by a new proof. We further extend this property to distribution families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in classification, optimization, portfolio selection and Markov decision processes. 1
6 0.65134066 53 nips-2009-Complexity of Decentralized Control: Special Cases
7 0.64873099 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
8 0.64806658 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models
9 0.64424127 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
10 0.61815792 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
11 0.60546494 204 nips-2009-Replicated Softmax: an Undirected Topic Model
12 0.58758587 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
13 0.58725804 96 nips-2009-Filtering Abstract Senses From Image Search Results
14 0.58559817 226 nips-2009-Spatial Normalized Gamma Processes
15 0.58114111 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
16 0.57851768 260 nips-2009-Zero-shot Learning with Semantic Output Codes
17 0.5774554 154 nips-2009-Modeling the spacing effect in sequential category learning
18 0.57355791 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
19 0.56771797 206 nips-2009-Riffled Independence for Ranked Data
20 0.56677282 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition