nips nips2009 nips2009-65 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chong Wang, David M. Blei
Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i. [sent-6, score-0.272]
2 In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. [sent-9, score-0.717]
3 Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. [sent-10, score-0.645]
4 We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. [sent-11, score-0.109]
5 Each word is assigned to a topic, and is drawn from a distribution over terms associated with that topic. [sent-16, score-0.119]
6 The per-document distributions over topics represent systematic regularities of word use among the documents; the per-topic distributions over terms encode the randomness inherent in observations from the topics. [sent-17, score-0.37]
7 Given a corpus of documents, analysis proceeds by approximating the posterior of the topics and topic proportions. [sent-19, score-0.554]
8 , the distributions over topics associated with each document, and a representation of our uncertainty surrounding observations from each of those components, i. [sent-23, score-0.222]
9 In the HDP for document modeling, the topics are typically assumed drawn from an exchangeable Dirichlet, a Dirichlet for which the components of the vector parameter are equal to the same scalar parameter. [sent-27, score-0.269]
10 The single scalar Dirichlet parameter affects both the sparsity of the topics and smoothness of the word probabilities within them. [sent-34, score-0.416]
11 Globally, posterior inference will prefer more topics because more sparse topics are needed to account for the observed 1 words of the collection. [sent-36, score-0.509]
12 Locally, the per-topic distribution over terms will be less smooth—the posterior distribution has more confidence in its assessment of the per-topic word probabilities—and this results in less smooth document-specific predictive distributions. [sent-37, score-0.232]
13 The goal of this work is to decouple sparsity and smoothness in the HDP. [sent-38, score-0.16]
14 With the sparse topic model (sparseTM), we can fit sparse topics with more smoothing. [sent-39, score-0.527]
15 Rather than placing a prior for the entire vocabulary, we introduce a Bernoulli variable for each term and each topic to determine whether or not the term appears in the topic. [sent-40, score-0.337]
16 Conditioned on these variables, each topic is represented by a multinomial distribution over its subset of the vocabulary, a sparse representation. [sent-41, score-0.315]
17 This prior smoothes only the relevant terms and thus the smoothness and sparsity are controlled through different hyper-parameters. [sent-42, score-0.211]
18 2 Sparse Topic Models Sparse topic models (sparseTMs) aim to separately control the number of terms in a topic, i. [sent-44, score-0.324]
19 Recall that a topic is a pattern of word use, represented as a distribution over the fixed vocabulary of the collection. [sent-49, score-0.434]
20 In order to decouple smoothness and sparsity, we define a topic on a random subset of the vocabulary (giving sparsity), and then model uncertainty of the probabilities on that subset (giving smoothness). [sent-50, score-0.448]
21 For each topic, we introduce a Bernoulli variable for each term in the vocabulary that decides whether the term appears in the topic. [sent-51, score-0.111]
22 A Dirichlet distribution over the topic is defined on a V − 1-simplex, i. [sent-56, score-0.295]
23 In an sparseTM, the idea of imposing sparsity is to use Bernoulli variables to restrict the size of the simplex over which the Dirichlet distribution is defined. [sent-59, score-0.099]
24 1 In the Bayesian nonparametric setting the number of topics is not specified in advance or found by model comparison. [sent-66, score-0.214]
25 (a) For each term v, 1 ≤ v ≤ V , draw term selector bkv ∼ Bernoulli(πk ). [sent-73, score-0.365]
26 V (b) Let bV +1 = 1[ v=1 bkv = 0] and bk = [bkv ]V +1 . [sent-74, score-0.684]
27 Draw stick lengths α ∼ GEM(λ), which are the global topic proportions. [sent-77, score-0.426]
28 For document d: (a) Draw per-document topic proportions θd ∼ DP(τ, α). [sent-79, score-0.352]
29 Draw word wdi ∼ Mult(βzdi ) Figure 1 illustrates the sparseTM as a graphical model. [sent-83, score-0.274]
30 The distinguishing feature of the sparseTM is step 1, which generates the latent topics in such a way that decouples sparsity and smoothness. [sent-86, score-0.33]
31 For each topic k there is a corresponding Beta random variable πk and a set of Bernoulli variables bkv s, one for each term in the vocabulary. [sent-87, score-0.549]
32 Define the sparsity of the topic as sparsityk 1− V v=1 bkv /V. [sent-88, score-0.637]
33 (4) The conditional distribution of the topic βk given the vocabulary subset bk is Dirichlet(γbk ). [sent-91, score-0.84]
34 Thus, topic k is represented by those terms with non-zero bkv s, and the smoothing is only enforced over these terms through hyperparameter γ. [sent-92, score-0.653]
35 Sparsity, which is determined by the pattern of ones in bk , is controlled by the Bernoulli parameter. [sent-93, score-0.469]
36 V One nuance is that we introduce bV +1 = 1[ v=1 bkv = 0]. [sent-95, score-0.233]
37 The term bV +1 extends the vocabulary to V + 1 terms, where the V + 1th term never appears in the documents. [sent-97, score-0.111]
38 We next compute the marginal distribution of βk , after integrating out Bernoullis bk and their parameter πk : p(βk |γ, r, s) = dπk p(βk |γ, πk )p(πk |r, s) p(βk |γ, bk ) = dπk p(bk |πk )p(πk |r, s). [sent-99, score-0.902]
39 bk We see that p(βk |γ, r, s) and p(βk |γ, πk ) are mixtures of Dirichlet distributions, where the mixture components are defined over simplices of different dimensions. [sent-100, score-0.451]
40 In total, there are 2V components; each configuration of Bernoulli variables bk specifies one particular component. [sent-101, score-0.451]
41 The stick lengths α come from a Griffiths, Engen, and McCloskey (GEM) distribution [8], which is drawn using the stick-breaking construction [9], ηk ∼ Beta(1, λ), αk = ηk k−1 j=1 (1 − ηj ), k ∈ {1, 2, . [sent-107, score-0.131]
42 The stick lengths are used as a base measure in the Dirichlet process prior on the per-document topic proportions, θd ∼ DP(τ, α). [sent-112, score-0.426]
43 Finally, the generative process for the topic assignments z and observed words w is straightforward. [sent-113, score-0.38]
44 3 Approximate posterior inference using collapsed Gibbs sampling Since the posterior inference is intractable in sparseTMs, we turn to a collapsed Gibbs sampling algorithm for posterior inference. [sent-114, score-0.352]
45 In order to do so, we integrate out topic proportions θ, topic distributions β and term selectors b analytically. [sent-115, score-0.688]
46 The latent variables needed by the sampling algorithm 3 are stick lengths α, Bernoulli parameter π and topic assignment z. [sent-116, score-0.546]
47 To sample α and topic assignments z, we use the direct-assignment method, which is based on an analogy to the Chinese restaurant franchise (CRF) [1]. [sent-118, score-0.41]
48 The number of customers in restaurant d (document) eating dish k (topic) is denoted ndk , and nd· denotes the number of customers in restaurant d. [sent-121, score-0.193]
49 The number of tables in restaurant d serving dish k is denoted mdk , md· denotes the number of tables in restaurant d, m·k denotes the number of tables serving dish k, and m·· denotes the total number of tables occupied. [sent-122, score-0.332]
50 The function nk denotes the number (·) of times that term v has been assigned to topic k, while nk denotes the number of times that all the terms have been assigned to topic k. [sent-125, score-0.876]
51 Index u is used to indicate the new topic in the sampling process. [sent-126, score-0.361]
52 Note that direct assignment sampling of α and z is conditioned on π. [sent-127, score-0.135]
53 The crux for sampling stick lengths α and topic assignments z (conditioned on π) is to compute the conditional density of wdi under the topic component k given all data items except wdi as, −w fk di (wdi = v|πk ) p(wdi = v|{wd i , zd i : zd i = k, d i = di}, πk ). [sent-128, score-1.407]
54 , V } be the set of vocabulary terms, Bk {v : (v) nk,−di > 0, v ∈ V} be the set of terms that have word assignments in topic k after excluding wdi and |Bk | be its cardinality. [sent-134, score-0.733]
55 3 We have the following, −w fk di (wdi = v|πk ) ∝ (v) (nk,−di + γ)E [gBk (X)|πk ] if v ∈ Bk , ¯ γπk E [gBk (X)|πk ] otherwise. [sent-136, score-0.092]
56 Further note Γ(·) is the Gamma function and nk,−di describes the corresponding count excluding word wdi . [sent-138, score-0.293]
57 The central difference between the algorithms for HDP-LDA and the sparseTM is ¯ conditional probability in Equation 6 which depends on the selector variables and selector proportions. [sent-140, score-0.137]
58 We now describe how we sample stick lengths α and topic assignments z. [sent-141, score-0.495]
59 Although α is an infinite-length vector, the number of topics K is finite at every point in the sampling process. [sent-144, score-0.258]
60 (9) If a new topic knew is sampled, then sample κ ∼ Beta(1, λ), and let αknew = καu and αunew = (1 − κ)αu . [sent-156, score-0.338]
61 Another sampling strategy is to sample b (by integrating out π) and the Gibbs sampler is much easier to derive. [sent-158, score-0.131]
62 However, conditioned on b, sampling z will be constrained to a smaller set of topics (specified by the values of b), which slows down convergence of the sampler. [sent-159, score-0.298]
63 To sample πk , we use bk as an auxiliary variable. [sent-162, score-0.473]
64 Recall Bk is the set of terms that have word assignments in topic k. [sent-164, score-0.441]
65 (This time, we don’t need to exclude certain words since we are sampling π. [sent-165, score-0.104]
66 Using this joint conditional distribution4 , we iteratively sample bk conditioned on πk and πk conditioned on bk to ultimately obtain a sample from πk . [sent-172, score-1.069]
67 Finally, with any single sample we can estimate topic distributions β from the value topic assignments z and term selector b by (v) ˆ βk,v = nk + bk,v γ (·) nk + , (11) v bkv γ where we can smooth only those terms that are chosen to be in the topics. [sent-178, score-1.246]
68 4 Experiments In this section, we studied the performance of the sparseTM on four datasets and demonstrated how sparseTM decouples the smoothness and sparsity in the HDP. [sent-180, score-0.168]
69 The sparsity proportion prior was a uniform Beta, i. [sent-182, score-0.095]
70 It has 2873 unique terms, around 128K observed words and an average of 36 unique terms per document. [sent-192, score-0.157]
71 We note that an improved P algorithm might be achieved by modeling the joint conditional distribution of πk and v bkv instead, i. [sent-194, score-0.258]
72 , P P p(πk , v bkv |rest), since sampling πk only depends on v bkv . [sent-196, score-0.532]
73 It has 2944 unique terms, around 179K observed words and an average of 52 unique terms per document. [sent-202, score-0.157]
74 We randomly sample 20% of the words for each paper and this leads to an average of 150 unique terms per document. [sent-209, score-0.144]
75 abstracts set data contains abstracts (including papers and posters) from six international conferences: CIKM, ICML, KDD, NIPS, SIGIR and WWW (http://www. [sent-212, score-0.112]
76 It has 3733 unique terms, around 173K observed words and an average of 46 unique terms per document. [sent-217, score-0.157]
77 Conditioned on these samples, we run the Gibbs sampler for test documents to estimate the predictive quantities of interest. [sent-223, score-0.137]
78 First, we examine overall predictive power with the predictive perplexity of the test set given the training set. [sent-226, score-0.227]
79 ) The predictive perplexity is log p(wd |Dtrain ) d∈Dtest Nd d∈Dtest perplexitypw = exp − . [sent-228, score-0.164]
80 Here we study its posterior distribution with the desideratum that between two equally good predictive distributions, a simpler model—or a posterior peaked at a simpler model—is preferred. [sent-235, score-0.209]
81 Recall that each Gibbs sample contains a topic assignment z for every observed word in the corpus (see Equation 9). [sent-237, score-0.435]
82 The topic complexity is the number of unique terms that have at least one word assigned to the topic. [sent-238, score-0.478]
83 This can be expressed as a sum of indicators, complexityk = d 1 [( n 1[zd,n = k]) > 0] , where recall that zd,n is the topic assignment for the nth word in document d. [sent-239, score-0.46]
84 Note a topic with no words assigned to it has complexity zero. [sent-240, score-0.382]
85 For a particular Gibbs sample, the model complexity is the sum of the topic complexities and the number of topics. [sent-241, score-0.343]
86 (12) We performed posterior inference with the sparseTM and HDP-LDA, computing predictive perplexity and average model complexity with 5-fold cross validation. [sent-243, score-0.26]
87 Figure 2 (first row) shows the model complexity versus predictive perplexity for each fold: Red circles represent sparseTM, blue squares represent HDP-LDA, and the dashed line connecting a red circle and blue square indicates the that the two are from the same fold. [sent-246, score-0.193]
88 These results shows that the sparseTM achieves better perplexity than HDP-LDA, and at simpler models. [sent-247, score-0.126]
89 01 18000 stm stm hdp−lda stm q q 1200 q q q q q stm hdp−lda q q q q 750 q q q q qq 1000 hdp−lda 960 q q q 700 q q q q 1450 q q q q q q Conf. [sent-262, score-1.132]
90 abstracts q 1350 hdp−lda q q 850 1250 q 800 q NIPS q 1040 Nematode Biology q q 1200 1150 perplexity 1300 arXiv q HDP−LDA STM HDP−LDA STM HDP−LDA STM HDP−LDA STM Figure 2: Experimental results for sparseTM (shortened as STM in this figure) and HDP-LDA on four datasets. [sent-263, score-0.157]
91 The scatter plots of model complexity versus predictive perplexity for 5-fold cross validation: Red circles represent the results from sparseTM, blue squares represent the results from HDP-LDA and the dashed lines connect results from the same fold. [sent-265, score-0.193]
92 Hyperparameter γ, number of topics and number of terms per topic. [sent-272, score-0.241]
93 Figure 2 (from the second to fourth rows) shows the Dirichlet parameter γ and posterior number of topics for HDP-LDA and sparseTM. [sent-273, score-0.24]
94 The numbers of terms per topic for two models don’t have a consistent trend, but they don’t differ too much either. [sent-276, score-0.344]
95 For the NIPS data set, we provide some example topics (with top 15 terms) discovered by HDP-LDA and sparseTM in Table 1. [sent-278, score-0.192]
96 In the Gibbs sampler, if the HDP-LDA posterior requires more topics to explain the data, it will reduce the value of γ to accommodate for the increased (necessary) sparseness. [sent-283, score-0.24]
97 This smaller γ, however, leads to less smooth topics that are less robust to “noise”, i. [sent-284, score-0.214]
98 The process is circular: To explain the noisy words, the Gibbs sampler might invoke new topics still, thereby further reducing the hyperparameter. [sent-287, score-0.235]
99 As a result of this interplay, HDP-LDA settles on more topics and a smaller γ. [sent-288, score-0.192]
100 For the sparseTM, however, more topics can be used to explain the data by using the sparsity control gained from the “spike” component of the prior. [sent-290, score-0.268]
wordName wordTfidf (topN-words)
[('bk', 0.451), ('sparsetm', 0.432), ('topic', 0.295), ('stm', 0.283), ('bkv', 0.233), ('hdp', 0.224), ('wdi', 0.204), ('topics', 0.192), ('dirichlet', 0.155), ('zdi', 0.146), ('lda', 0.126), ('ak', 0.123), ('bernoulli', 0.102), ('perplexity', 0.101), ('nk', 0.098), ('gbk', 0.083), ('sparsetms', 0.083), ('sparsity', 0.076), ('stick', 0.07), ('word', 0.07), ('vocabulary', 0.069), ('sampling', 0.066), ('di', 0.066), ('predictive', 0.063), ('lengths', 0.061), ('selector', 0.056), ('abstracts', 0.056), ('smoothness', 0.055), ('gibbs', 0.053), ('posterior', 0.048), ('assignments', 0.047), ('restaurant', 0.046), ('slab', 0.044), ('sampler', 0.043), ('conditioned', 0.04), ('words', 0.038), ('hyperparameter', 0.038), ('decouples', 0.037), ('bv', 0.035), ('gem', 0.035), ('supplement', 0.035), ('unique', 0.035), ('beta', 0.034), ('dish', 0.034), ('draw', 0.034), ('complexityk', 0.033), ('nematode', 0.033), ('smoothes', 0.033), ('sparsityk', 0.033), ('document', 0.033), ('spike', 0.032), ('documents', 0.031), ('tables', 0.031), ('distributions', 0.03), ('blei', 0.03), ('terms', 0.029), ('assignment', 0.029), ('decouple', 0.029), ('dtest', 0.029), ('smoothing', 0.029), ('complexity', 0.029), ('ndk', 0.027), ('wd', 0.027), ('fk', 0.026), ('conditional', 0.025), ('latent', 0.025), ('simpler', 0.025), ('proportions', 0.024), ('mult', 0.024), ('serving', 0.024), ('zd', 0.024), ('simplex', 0.023), ('scalar', 0.023), ('box', 0.023), ('integrate', 0.023), ('nonparametric', 0.022), ('smooth', 0.022), ('bank', 0.022), ('sample', 0.022), ('term', 0.021), ('exchangeable', 0.021), ('knew', 0.021), ('gamma', 0.021), ('per', 0.02), ('customers', 0.02), ('zk', 0.02), ('sparse', 0.02), ('assigned', 0.02), ('collapsed', 0.019), ('bayesian', 0.019), ('inference', 0.019), ('regularities', 0.019), ('complexities', 0.019), ('excluding', 0.019), ('hierarchical', 0.019), ('corpus', 0.019), ('proportion', 0.019), ('controlled', 0.018), ('ultimately', 0.018), ('crf', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Author: Chong Wang, David M. Blei
Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1
2 0.28336939 205 nips-2009-Rethinking LDA: Why Priors Matter
Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach
Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1
3 0.18972979 204 nips-2009-Replicated Softmax: an Undirected Topic Model
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.
4 0.14813545 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
Author: Richard Socher, Samuel Gershman, Per Sederberg, Kenneth Norman, Adler J. Perotte, David M. Blei
Abstract: We develop a probabilistic model of human memory performance in free recall experiments. In these experiments, a subject first studies a list of words and then tries to recall them. To model these data, we draw on both previous psychological research and statistical topic models of text documents. We assume that memories are formed by assimilating the semantic meaning of studied words (represented as a distribution over topics) into a slowly changing latent context (represented in the same space). During recall, this context is reinstated and used as a cue for retrieving studied words. By conceptualizing memory retrieval as a dynamic latent variable model, we are able to use Bayesian inference to represent uncertainty and reason about the cognitive processes underlying memory. We present a particle filter algorithm for performing approximate posterior inference, and evaluate our model on the prediction of recalled words in experimental data. By specifying the model hierarchically, we are also able to capture inter-subject variability. 1
5 0.13510877 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
Author: Peter Carbonetto, Matthew King, Firas Hamze
Abstract: We describe a new algorithmic framework for inference in probabilistic models, and apply it to inference for latent Dirichlet allocation (LDA). Our framework adopts the methodology of variational inference, but unlike existing variational methods such as mean field and expectation propagation it is not restricted to tractable classes of approximating distributions. Our approach can also be viewed as a “population-based” sequential Monte Carlo (SMC) method, but unlike existing SMC methods there is no need to design the artificial sequence of distributions. Significantly, our framework offers a principled means to exchange the variance of an importance sampling estimate for the bias incurred through variational approximation. We conduct experiments on a difficult inference problem in population genetics, a problem that is related to inference for LDA. The results of these experiments suggest that our method can offer improvements in stability and accuracy over existing methods, and at a comparable cost. 1
6 0.13278192 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
7 0.12463949 96 nips-2009-Filtering Abstract Senses From Image Search Results
8 0.12221934 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model
9 0.11944861 226 nips-2009-Spatial Normalized Gamma Processes
10 0.10629134 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
11 0.10535379 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
12 0.10180919 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
13 0.085952088 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
14 0.078634895 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
15 0.072878875 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes
16 0.06908454 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models
17 0.067744963 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process
18 0.06108579 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis
19 0.060043041 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution
20 0.052859191 157 nips-2009-Multi-Label Prediction via Compressed Sensing
topicId topicWeight
[(0, -0.154), (1, -0.101), (2, -0.074), (3, -0.195), (4, 0.138), (5, -0.248), (6, -0.01), (7, 0.012), (8, -0.029), (9, 0.235), (10, -0.056), (11, -0.014), (12, 0.014), (13, 0.112), (14, 0.056), (15, -0.028), (16, -0.203), (17, -0.12), (18, -0.016), (19, 0.071), (20, 0.02), (21, 0.015), (22, 0.035), (23, -0.097), (24, -0.065), (25, -0.041), (26, 0.032), (27, -0.114), (28, -0.022), (29, -0.004), (30, 0.054), (31, 0.038), (32, -0.001), (33, 0.057), (34, 0.06), (35, 0.07), (36, -0.005), (37, 0.027), (38, 0.031), (39, 0.036), (40, 0.023), (41, -0.046), (42, -0.044), (43, -0.033), (44, -0.092), (45, -0.032), (46, 0.05), (47, 0.008), (48, -0.117), (49, -0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.96664637 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Author: Chong Wang, David M. Blei
Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1
2 0.90997976 205 nips-2009-Rethinking LDA: Why Priors Matter
Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach
Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1
3 0.80237085 204 nips-2009-Replicated Softmax: an Undirected Topic Model
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.
4 0.79145151 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model
Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda
Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.
5 0.6991275 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
Author: Feng Yan, Ningyi Xu, Yuan Qi
Abstract: The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1
7 0.5288862 226 nips-2009-Spatial Normalized Gamma Processes
8 0.51643533 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
9 0.50800067 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
10 0.48824686 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
11 0.46130598 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models
12 0.44077787 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
13 0.38632715 96 nips-2009-Filtering Abstract Senses From Image Search Results
14 0.30678535 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process
15 0.29616565 114 nips-2009-Indian Buffet Processes with Power-law Behavior
16 0.29557794 90 nips-2009-Factor Modeling for Advertisement Targeting
17 0.28515026 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
18 0.26721224 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis
19 0.25721672 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
20 0.25145102 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution
topicId topicWeight
[(24, 0.032), (25, 0.061), (31, 0.033), (35, 0.05), (36, 0.075), (39, 0.037), (46, 0.264), (58, 0.058), (66, 0.016), (71, 0.17), (81, 0.025), (86, 0.062), (91, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.82066727 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Author: Chong Wang, David M. Blei
Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1
2 0.74959069 96 nips-2009-Filtering Abstract Senses From Image Search Results
Author: Kate Saenko, Trevor Darrell
Abstract: We propose an unsupervised method that, given a word, automatically selects non-abstract senses of that word from an online ontology and generates images depicting the corresponding entities. When faced with the task of learning a visual model based only on the name of an object, a common approach is to find images on the web that are associated with the object name and train a visual classifier from the search result. As words are generally polysemous, this approach can lead to relatively noisy models if many examples due to outlier senses are added to the model. We argue that images associated with an abstract word sense should be excluded when training a visual classifier to learn a model of a physical object. While image clustering can group together visually coherent sets of returned images, it can be difficult to distinguish whether an image cluster relates to a desired object or to an abstract sense of the word. We propose a method that uses both image features and the text associated with the images to relate latent topics to particular senses. Our model does not require any human supervision, and takes as input only the name of an object category. We show results of retrieving concrete-sense images in two available multimodal, multi-sense databases, as well as experiment with object classifiers trained on concrete-sense images returned by our method for a set of ten common office objects. 1
3 0.74868524 48 nips-2009-Bootstrapping from Game Tree Search
Author: Joel Veness, David Silver, Alan Blair, William W. Cohen
Abstract: In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuel’s checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play. 1
4 0.63465685 56 nips-2009-Conditional Neural Fields
Author: Jian Peng, Liefeng Bo, Jinbo Xu
Abstract: Conditional random fields (CRF) are widely used for sequence labeling such as natural language processing and biological sequence analysis. Most CRF models use a linear potential function to represent the relationship between input features and output. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and output is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input and output we propose a new conditional probabilistic graphical model, Conditional Neural Fields (CNF), for sequence labeling. CNF extends CRF by adding one (or possibly more) middle layer between input and output. The middle layer consists of a number of gate functions, each acting as a local neuron or feature extractor to capture the nonlinear relationship between input and output. Therefore, conceptually CNF is much more expressive than CRF. Experiments on two widely-used benchmarks indicate that CNF performs significantly better than a number of popular methods. In particular, CNF is the best among approximately 10 machine learning methods for protein secondary structure prediction and also among a few of the best methods for handwriting recognition.
5 0.63207996 11 nips-2009-A General Projection Property for Distribution Families
Author: Yao-liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvári
Abstract: Surjectivity of linear projections between distribution families with fixed mean and covariance (regardless of dimension) is re-derived by a new proof. We further extend this property to distribution families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in classification, optimization, portfolio selection and Markov decision processes. 1
6 0.62490237 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
7 0.62373936 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models
8 0.62210602 53 nips-2009-Complexity of Decentralized Control: Special Cases
9 0.62020952 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
10 0.60753828 205 nips-2009-Rethinking LDA: Why Priors Matter
11 0.58487201 204 nips-2009-Replicated Softmax: an Undirected Topic Model
12 0.57911384 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
13 0.57518727 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
14 0.57131004 226 nips-2009-Spatial Normalized Gamma Processes
15 0.5688796 260 nips-2009-Zero-shot Learning with Semantic Output Codes
16 0.56225389 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
17 0.56068069 154 nips-2009-Modeling the spacing effect in sequential category learning
18 0.55790836 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization
19 0.55780017 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
20 0.55749953 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals