nips nips2007 nips2007-47 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yee W. Teh, Kenichi Kurihara, Max Welling
Abstract: A wide variety of Dirichlet-multinomial ‘topic’ models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the first variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a significant improvement in accuracy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. [sent-13, score-0.96]
2 We address both issues by generalizing the technique, obtaining the first variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. [sent-14, score-0.513]
3 1 Introduction Many applications of graphical models have traditionally dealt with discrete state spaces, where each variable is multinomial distributed given its parents [1]. [sent-16, score-0.091]
4 Without strong prior knowledge on the structure of dependencies between variables and their parents, the typical Bayesian prior over parameters has been the Dirichlet distribution. [sent-17, score-0.049]
5 This is because the Dirichlet prior is conjugate to the multinomial, leading to simple and efficient computations for both the posterior over parameters and the marginal likelihood of data. [sent-18, score-0.061]
6 When there are latent or unobserved variables, the variational Bayesian approach to posterior estimation, where the latent variables are assumed independent from the parameters, has proven successful [2]. [sent-19, score-0.54]
7 In recent years there has been a proliferation of graphical models composed of a multitude of multinomial and Dirichlet variables interacting in various inventive ways. [sent-20, score-0.11]
8 The major classes include the latent Dirichlet allocation (LDA) [3] and many other topic models inspired by LDA, and the hierarchical Dirichlet process (HDP) [4] and many other nonparametric models based on the Dirichlet process (DP). [sent-21, score-0.22]
9 LDA pioneered the use of Dirichlet distributed latent variables to represent shades of membership to different clusters or topics, while the HDP pioneered the use of nonparametric models to sidestep the need for model selection. [sent-22, score-0.226]
10 For these Dirichlet-multinomial models the inference method of choice is typically collapsed Gibbs sampling, due to its simplicity, speed, and good predictive performance on test sets. [sent-23, score-0.198]
11 For LDA and its cousins, there are alternatives based on variational Bayesian (VB) approximations [3] and on expectation propagation (EP) [5]. [sent-26, score-0.394]
12 [7] addressed these issues by proposing an improved VB approximation based on the idea of collapsing, that is, integrating out the parameters while assuming that other latent variables are independent. [sent-28, score-0.159]
13 As for nonparametric models, a number of VB approximations have been proposed for DP mixture models [8, 9], while to our knowledge none has been proposed for the HDP thus far ([10] derived a VB inference for the HDP, but dealt only with point estimates for higher level parameters). [sent-29, score-0.137]
14 In this paper we investigate a new VB approach to inference for the class of Dirichlet-multinomial models. [sent-30, score-0.043]
15 To be concrete we focus our attention on an application of the HDP to topic modeling [4], though the approach is more generally applicable. [sent-31, score-0.099]
16 Our approach is an extension of the collapsed VB approximation for LDA (CV-LDA) presented in [7], and represents the first VB approximation to the HDP1 . [sent-32, score-0.205]
17 The advantage of CV-HDP over CV-LDA is that the optimal number of variational components is not finite. [sent-34, score-0.341]
18 Ours is also the first variational algorithm to treat full posterior distributions over the hyperparameters of Dirichlet variables, and we show experimentally that this results in significant improvements in both the variational bound and test-set likelihood. [sent-36, score-0.833]
19 2 A Nonparametric Hierarchical Bayesian Topic Model We consider a document model where each document in a corpus is modelled as a mixture over topics, and each topic is a distribution over words in the vocabulary. [sent-38, score-0.274]
20 For each topic k, let φk be a vector of probabilities for words in that topic. [sent-44, score-0.126]
21 Words in each document are drawn as follows: first choose a topic k with probability θdk , then choose a word w with probability φkw . [sent-45, score-0.187]
22 Let xid be the ith word token in document d, and zid its chosen topic. [sent-46, score-0.593]
23 If the number of topics K is finite and fixed, the above model is LDA. [sent-48, score-0.196]
24 As we usually do not know the number of topics a priori, and would like a model that can determine this automatically, we consider a nonparametric extension reposed on the HDP [4]. [sent-49, score-0.234]
25 Specifically, we have a countably infinite number of topics (thus θd and π are infinite-dimensional vectors), and we use a stick-breaking representation [11] for π: πk = πk ˜ k−1 l=1 (1 − πl ) ˜ πk |γ ∼ Beta(1, γ) ˜ for k = 1, 2, . [sent-50, score-0.196]
26 Finally, in addition to the prior over π, we place priors over the other hyperparameters α, β, γ and τ of the model as well, α ∼ Gamma(aα , bα ) β ∼ Gamma(aβ , bβ ) γ ∼ Gamma(aγ , bγ ) τ ∼ Dir(aτ ) (4) The full model is shown graphically in Figure 1(left). [sent-54, score-0.064]
27 1 In this paper, by HDP we shall mean the two level HDP topic model in Section 2. [sent-55, score-0.138]
28 We do not claim to have derived a VB inference for the general HDP in [4], which is significantly more difficult; see final discussions. [sent-56, score-0.043]
29 Right: Factor graph of the model with auxiliary variables. [sent-76, score-0.125]
30 3 Collapsed Variational Bayesian Inference for HDP There is substantial empirical evidence that marginalizing out variables is helpful for efficient inference. [sent-77, score-0.049]
31 For instance, in [12] it was observed that Gibbs sampling enjoys better mixing, while in [7] it was shown that variational inference is more accurate in this collapsed space. [sent-78, score-0.57]
32 In the following we will build on this experience and propose a collapsed variational inference algorithm for the HDP, based upon first replacing the parameters with auxiliary variables, then effectively collapsing out the auxiliary variables variationally. [sent-79, score-0.869]
33 The algorithm is fully Bayesian in the sense that all parameter posteriors are treated exactly and full posterior distributions are maintained for all hyperparameters. [sent-80, score-0.104]
34 The only assumptions made are independencies among the latent topic variables and hyperparameters, and that there is a finite upper bound on the number of topics used (which is found automatically). [sent-81, score-0.423]
35 1 Replacing parameters with auxiliary variables In order to obtain efficient variational updates, we shall replace the parameters θ = {θd } and φ = {φk } with auxiliary variables. [sent-87, score-0.679]
36 Thus we introduce four sets of auxiliary variables: ηd and ξk taking values in [0, 1], and sdk and tkw taking integral values. [sent-90, score-0.408]
37 The main insight is that conditioned on z and x the auxiliary variables are independent and have well-known distributions. [sent-93, score-0.174]
38 Specifically, ηd and ξk are Beta distributed, while sdk (respectively tkw ) is the random number of occupied tables in a Chinese restaurant process with ndk· (respectively n·kw ) customers and a strength parameter of απk (respectively βτw ) [13, 4]. [sent-94, score-0.328]
39 [7] showed that modelling exactly the dependence of a set of variables on another set is equivalent to integrating out the first set. [sent-97, score-0.07]
40 Thus we can interpret (7) as integrating out the auxiliary variables with respect to z. [sent-98, score-0.174]
41 Given the above factorization, q(π) further factorizes so that the πk ’s are independent, as do the posterior over ˜ auxiliary variables. [sent-99, score-0.185]
42 Our truncation approximation is nested like that in [9], and unlike that in [8]. [sent-104, score-0.088]
43 We shall treat K as a parameter of the variational approximation, possibly optimized by iteratively splitting or merging topics (though we have not explored these in this paper; see discussion section). [sent-106, score-0.642]
44 As in [9], we reordered the topic labels such that E[n·1· ] > E[n·2· ] > · · · . [sent-107, score-0.099]
45 An expression for the variational bound on the marginal log-likelihood is given in appendix A. [sent-108, score-0.386]
46 3 Variational Updates In this section we shall derive the complete set of variational updates for the system. [sent-110, score-0.419]
47 We shall also employ index ∂y summation shorthands: · sums out that index, while >l sums over i where i > l. [sent-113, score-0.113]
48 Updates for the hyperparameters are derived using the standard fully factorized variational approach, since they are assumed independent from each other and from other variables. [sent-115, score-0.405]
49 Note also that the geometric expectations factorizes: k xk k−1 G[απk ] = G[α]G[πk ], G[βτw ] = G[β]G[τw ] and G[πk ] = G[˜k ] l=1 G[1 − πl ]. [sent-117, score-0.083]
50 The variational posteriors for the auxiliary variables depend on z through the counts ndkw . [sent-119, score-0.615]
51 If ndk· = 0 then q(sdk = 0) = 1 otherwise q(sdk ) > 0 only if 1 ≤ sdk ≤ ndk· . [sent-121, score-0.157]
52 The posteriors are: E[α]−1 q(ηd |z) ∝ ηd q(ξk |z) ∝ (1 − ηd )nd·· −1 dk· q(sdk = m|z) ∝ [nm ] (G[απk ])m n·k· −1 ·kw [nm ] (G[βτw ])m E[β]−1 ξk (1 − ξk ) q(tkw = m|z) ∝ (9) To obtain expectations of the auxiliary variables in (8) we will have to average over z as well. [sent-123, score-0.295]
53 For ηd this is E[log ηd ] = Ψ(E[α]) − Ψ(E[α] + nd·· ) where nd·· is the (fixed) number of words in document d. [sent-124, score-0.088]
54 For the other auxiliary variables these expectations depend on counts which can take on many values and a na¨ve computation can be expensive. [sent-125, score-0.231]
55 We derive computationally tractable ı approximations based upon an improvement to the second-order approximation in [7]. [sent-126, score-0.08]
56 [7] tackled a similar problem with log instead of Ψ using a second order Taylor expansion to log. [sent-130, score-0.057]
57 Unfortunately such an approximation failed to work in our case as the digamma function Ψ(y) diverges much more quickly than log y at y = 0. [sent-131, score-0.12]
58 Our solution is to treat the case n·k· = 0 exactly, and apply the second-order approximation when n·k· > 0. [sent-132, score-0.048]
59 [7] showed that if the dependence of a set of variables, say A, on another set of variables, say z, is modelled exactly, then in deriving the updates for z we may equivalently integrate out A. [sent-137, score-0.086]
60 The variables β, τ are not adapted in that code, so we fixed them at β = 100 and τw = 1/W for all algorithms (see below for discussion regarding adapting these in CV-HDP). [sent-155, score-0.049]
61 G-HDP was initialized with either 1 topic (G-HDP1 ) or with 100 topics (G-HDP100 ). [sent-156, score-0.321]
62 We set2 hyperparameters aα , bα , aβ , bβ in the range between [2, 6], while aγ , bγ was chosen in the range [5, 10] and aτ in [30 − 50]/W . [sent-158, score-0.064]
63 The number of topics used in CV-HDP was truncated at 40, 80, and 120 topics, corresponding to the number of topics used in the LDA algorithms. [sent-159, score-0.392]
64 Note that this scheme simply converts prior expectations about the number of topics and amount of sharing into hyperparameter values, and that they were never tweaked. [sent-163, score-0.253]
65 Since they always ended up in these compact ranges and since we do not expect a strong dependence on their values inside these ranges we choose to omit the details. [sent-164, score-0.046]
66 Performance was evaluated by comparing i) the in-sample (train) variational bound on the loglikelihood for all three variational methods and ii) the out-of-sample (test) log-likelihood for all five methods. [sent-165, score-0.706]
67 All inference algorithms were run on 90% of the words in each document while testset performance was evaluated on the remaining 10% of the words. [sent-166, score-0.131]
68 Test-set log-likelihood was computed as follows for the variational methods: απ +E [n ] βτ +E [n ] ¯ ¯ ¯ ¯ p(xtest ) = θjk φkxtest θjk = k q jk· φkw = w q ·kw (16) ij k α+Eq [nj·· ] ij β+Eq [n·k· ] Note that we used estimated mean values of θjk and φkw [14]. [sent-167, score-0.387]
69 For CV-HDP we replaced all hyperparameters by their expectations. [sent-168, score-0.064]
70 The results, shown in Figure 2, display a significant improvement in accuracy of CV-HDP over CV-LDA, both in terms of the bound on the training log-likelihood as well as for the test-set loglikelihood. [sent-170, score-0.046]
71 This is caused by the fact that CV-HDP is learning the variational distributions over the hyperparameters. [sent-171, score-0.341]
72 A second observation is that convergence of all variational methods is faster than for the sampling methods. [sent-174, score-0.372]
73 Thirdly, we see significant local optima effects in our simulations. [sent-175, score-0.07]
74 For example, G-HDP100 achieves the best results, better than G-HDP1 , indicating that pruning topics is a better way than adding topics to escape local optima in these models and leads to better posterior modes. [sent-176, score-0.502]
75 In further experiments we have also found that the variational methods benefit from better initializations due to local optima. [sent-177, score-0.341]
76 In Figure 3 we show results when the variational methods were initialized at the last state obtained by G-HDP100 . [sent-178, score-0.367]
77 We see that indeed the variational methods were able to find significantly better local optima in the vicinity of the one found by G-HDP100 , and that CV-HDP is still consistently better than the other variational methods. [sent-179, score-0.752]
78 5 Discussion In this paper we have explored collapsed variational inference for the HDP. [sent-180, score-0.56]
79 Our algorithm is the first to deal with the HDP and with posteriors over the parameters of Dirichlet distributions. [sent-181, score-0.089]
80 We found that the CV-HDP performs significantly better than the CV-LDA on both test-set likelihood and the variational bound. [sent-182, score-0.341]
81 A caveat is that CV-HDP gives slightly worse test-set likelihood than collapsed Gibbs sampling. [sent-183, score-0.186]
82 However, as discussed in the introduction, we believe there are advantages to variational approximations that are not available to sampling methods. [sent-184, score-0.405]
83 A second caveat is that our variational approximation works only for two layer HDPs—a layer of group-specific DPs, and a global DP tying the groups together. [sent-185, score-0.397]
84 It would be interesting to explore variational approximations for more general HDPs. [sent-186, score-0.374]
85 Firstly, we use a more sophisticated variational approximation that can infer posterior distributions over the higher level variables in the model. [sent-188, score-0.455]
86 Secondly, we use a more sophisticated HDP based model with an infinite number of topics, and allow the model to find an appropriate number of topics automatically. [sent-189, score-0.196]
87 These two advances are coupled, because we needed the more sophisticated variational approximation to deal with the HDP. [sent-190, score-0.391]
88 Firstly, we have a new truncation technique that guarantees nesting. [sent-192, score-0.063]
89 As a result we know that the variational bound on the marginal log-likelihood will reach its highest value (ignoring local optima issues) when K → ∞. [sent-193, score-0.456]
90 This fact should facilitate the search over number of topics or clusters, e. [sent-194, score-0.196]
91 by splitting and merging topics, an aspect that we have not yet fully explored, and for which we expect to gain significantly from in the face of the observed local optima issues in the experiments. [sent-196, score-0.122]
92 Secondly, we have an improved secondorder approximation that is able to handle the often encountered digamma function accurately. [sent-197, score-0.063]
93 The standard evaluation criteria in this area of research are the variational bound ! [sent-199, score-0.365]
94 Top row: log p(xtest ) as a function of K, Middle row: log p(xtest ) as a function of number of steps (defined as number of iterations multiplied by K) and Bottom row: variational bounds as a function of K. [sent-283, score-0.455]
95 The distribution over the number of topics found by G-HDP1 are: KOS: K = 113. [sent-286, score-0.196]
96 8 0 5000 #ste,s #0000 0 5000 10000 #steps Figure 3: G-HDP100 initialized variational methods (K = 130), compared against variational methods initialized in the usual manner with K = 130 as well. [sent-324, score-0.734]
97 An alternative is to compare the computed posteriors over latent variables on toy problems with known true values. [sent-328, score-0.168]
98 For some applications variational approximations may prove to be the most convenient tool for inference. [sent-331, score-0.374]
99 Acknowledgements We thank the reviewers for thoughtful and constructive comments. [sent-334, score-0.048]
100 A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. [sent-389, score-0.594]
wordName wordTfidf (topN-words)
[('ndk', 0.393), ('zid', 0.343), ('variational', 0.341), ('kw', 0.307), ('hdp', 0.298), ('topics', 0.196), ('dirichlet', 0.18), ('xid', 0.162), ('sdk', 0.157), ('collapsed', 0.155), ('id', 0.133), ('tkw', 0.126), ('auxiliary', 0.125), ('vb', 0.12), ('lda', 0.101), ('topic', 0.099), ('cvhdp', 0.09), ('cvlda', 0.09), ('vlda', 0.09), ('gibbs', 0.087), ('dk', 0.078), ('xtest', 0.072), ('optima', 0.07), ('kx', 0.067), ('posteriors', 0.064), ('hyperparameters', 0.064), ('dir', 0.063), ('truncation', 0.063), ('kos', 0.063), ('jk', 0.061), ('document', 0.061), ('reuters', 0.057), ('log', 0.057), ('expectations', 0.057), ('latent', 0.055), ('kurihara', 0.054), ('dp', 0.05), ('variables', 0.049), ('ns', 0.043), ('inference', 0.043), ('gamma', 0.041), ('posterior', 0.04), ('shall', 0.039), ('beta', 0.039), ('updates', 0.039), ('nonparametric', 0.038), ('digamma', 0.038), ('bayesian', 0.037), ('glda', 0.036), ('idid', 0.036), ('ndkw', 0.036), ('proliferation', 0.036), ('ee', 0.036), ('approximations', 0.033), ('caveat', 0.031), ('collapsing', 0.031), ('mult', 0.031), ('pioneered', 0.031), ('sampling', 0.031), ('ak', 0.03), ('issues', 0.03), ('teh', 0.029), ('welling', 0.029), ('gd', 0.029), ('kl', 0.028), ('hierarchical', 0.028), ('word', 0.027), ('sums', 0.027), ('words', 0.027), ('quantities', 0.026), ('geometric', 0.026), ('initialized', 0.026), ('modelled', 0.026), ('approximation', 0.025), ('ended', 0.025), ('constructive', 0.025), ('restaurant', 0.025), ('multinomial', 0.025), ('averages', 0.025), ('deal', 0.025), ('bound', 0.024), ('ez', 0.024), ('beal', 0.024), ('reviewers', 0.023), ('dealt', 0.023), ('treat', 0.023), ('ij', 0.023), ('distributed', 0.022), ('chinese', 0.022), ('merging', 0.022), ('improvement', 0.022), ('parents', 0.021), ('explored', 0.021), ('marginal', 0.021), ('dependence', 0.021), ('nm', 0.021), ('index', 0.02), ('alternatives', 0.02), ('factorizes', 0.02), ('tables', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 47 nips-2007-Collapsed Variational Inference for HDP
Author: Yee W. Teh, Kenichi Kurihara, Max Welling
Abstract: A wide variety of Dirichlet-multinomial ‘topic’ models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the first variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a significant improvement in accuracy. 1
2 0.24094228 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
Author: Max Welling, Ian Porteous, Evgeniy Bart
Abstract: A general modeling framework is proposed that unifies nonparametric-Bayesian models, topic-models and Bayesian networks. This class of infinite state Bayes nets (ISBN) can be viewed as directed networks of ‘hierarchical Dirichlet processes’ (HDPs) where the domain of the variables can be structured (e.g. words in documents or features in images). We show that collapsed Gibbs sampling can be done efficiently in these models by leveraging the structure of the Bayes net and using the forward-filtering-backward-sampling algorithm for junction trees. Existing models, such as nested-DP, Pachinko allocation, mixed membership stochastic block models as well as a number of new models are described as ISBNs. Two experiments have been performed to illustrate these ideas. 1
3 0.21511227 189 nips-2007-Supervised Topic Models
Author: Jon D. Mcauliffe, David M. Blei
Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1
4 0.16861032 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢
5 0.13781855 183 nips-2007-Spatial Latent Dirichlet Allocation
Author: Xiaogang Wang, Eric Grimson
Abstract: In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely applied in the computer vision field. However, many of these applications have difficulty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a “bag-of-words”. It is also critical to properly design “words” and “documents” when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structures among visual words that are essential for solving many vision problems. The spatial information is not encoded in the values of visual words but in the design of documents. Instead of knowing the partition of words into documents a priori, the word-document assignment becomes a random hidden variable in SLDA. There is a generative procedure, where knowledge of spatial structure can be flexibly added as a prior, grouping visual words which are close in space into the same document. We use SLDA to discover objects from a collection of images, and show it achieves better performance than LDA. 1
6 0.12544791 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
7 0.11164635 213 nips-2007-Variational Inference for Diffusion Processes
8 0.10485528 129 nips-2007-Mining Internet-Scale Software Repositories
9 0.097712092 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
10 0.096013263 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
11 0.095233545 145 nips-2007-On Sparsity and Overcompleteness in Image Models
12 0.080089644 197 nips-2007-The Infinite Markov Model
13 0.065622352 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
14 0.063383669 84 nips-2007-Expectation Maximization and Posterior Constraints
15 0.060567159 99 nips-2007-Hierarchical Penalization
16 0.056485374 214 nips-2007-Variational inference for Markov jump processes
17 0.055510141 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
18 0.055260167 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
19 0.049011502 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
20 0.048671123 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
topicId topicWeight
[(0, -0.167), (1, 0.071), (2, -0.061), (3, -0.308), (4, 0.065), (5, -0.103), (6, 0.001), (7, -0.165), (8, -0.072), (9, 0.05), (10, 0.035), (11, 0.008), (12, -0.027), (13, 0.145), (14, 0.061), (15, 0.059), (16, 0.121), (17, -0.006), (18, -0.035), (19, -0.1), (20, 0.026), (21, -0.026), (22, -0.097), (23, 0.06), (24, 0.094), (25, 0.071), (26, -0.021), (27, 0.062), (28, 0.083), (29, -0.033), (30, -0.006), (31, -0.022), (32, 0.088), (33, -0.059), (34, 0.008), (35, 0.031), (36, 0.035), (37, 0.042), (38, 0.028), (39, -0.1), (40, 0.046), (41, -0.026), (42, 0.002), (43, 0.004), (44, -0.004), (45, -0.021), (46, 0.096), (47, -0.155), (48, 0.026), (49, -0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.95815676 47 nips-2007-Collapsed Variational Inference for HDP
Author: Yee W. Teh, Kenichi Kurihara, Max Welling
Abstract: A wide variety of Dirichlet-multinomial ‘topic’ models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the first variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a significant improvement in accuracy. 1
2 0.78410423 189 nips-2007-Supervised Topic Models
Author: Jon D. Mcauliffe, David M. Blei
Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1
3 0.78251886 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢
4 0.71788907 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
Author: Max Welling, Ian Porteous, Evgeniy Bart
Abstract: A general modeling framework is proposed that unifies nonparametric-Bayesian models, topic-models and Bayesian networks. This class of infinite state Bayes nets (ISBN) can be viewed as directed networks of ‘hierarchical Dirichlet processes’ (HDPs) where the domain of the variables can be structured (e.g. words in documents or features in images). We show that collapsed Gibbs sampling can be done efficiently in these models by leveraging the structure of the Bayes net and using the forward-filtering-backward-sampling algorithm for junction trees. Existing models, such as nested-DP, Pachinko allocation, mixed membership stochastic block models as well as a number of new models are described as ISBNs. Two experiments have been performed to illustrate these ideas. 1
5 0.53529954 183 nips-2007-Spatial Latent Dirichlet Allocation
Author: Xiaogang Wang, Eric Grimson
Abstract: In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely applied in the computer vision field. However, many of these applications have difficulty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a “bag-of-words”. It is also critical to properly design “words” and “documents” when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structures among visual words that are essential for solving many vision problems. The spatial information is not encoded in the values of visual words but in the design of documents. Instead of knowing the partition of words into documents a priori, the word-document assignment becomes a random hidden variable in SLDA. There is a generative procedure, where knowledge of spatial structure can be flexibly added as a prior, grouping visual words which are close in space into the same document. We use SLDA to discover objects from a collection of images, and show it achieves better performance than LDA. 1
6 0.51048958 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
7 0.49616554 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
8 0.46038139 129 nips-2007-Mining Internet-Scale Software Repositories
9 0.41838571 214 nips-2007-Variational inference for Markov jump processes
10 0.40519702 213 nips-2007-Variational Inference for Diffusion Processes
11 0.38122243 197 nips-2007-The Infinite Markov Model
12 0.36897644 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
13 0.3283079 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents
14 0.31623244 145 nips-2007-On Sparsity and Overcompleteness in Image Models
15 0.31083155 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
16 0.3070356 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
17 0.2910845 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
18 0.27947244 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
19 0.27236289 99 nips-2007-Hierarchical Penalization
20 0.26765031 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
topicId topicWeight
[(5, 0.03), (13, 0.033), (16, 0.036), (18, 0.044), (21, 0.047), (31, 0.014), (34, 0.024), (35, 0.025), (47, 0.073), (49, 0.024), (83, 0.09), (85, 0.041), (87, 0.083), (90, 0.06), (92, 0.294)]
simIndex simValue paperId paperTitle
same-paper 1 0.72088176 47 nips-2007-Collapsed Variational Inference for HDP
Author: Yee W. Teh, Kenichi Kurihara, Max Welling
Abstract: A wide variety of Dirichlet-multinomial ‘topic’ models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the first variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a significant improvement in accuracy. 1
2 0.61947858 156 nips-2007-Predictive Matrix-Variate t Models
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1
3 0.50060856 59 nips-2007-Continuous Time Particle Filtering for fMRI
Author: Lawrence Murray, Amos J. Storkey
Abstract: We construct a biologically motivated stochastic differential model of the neural and hemodynamic activity underlying the observed Blood Oxygen Level Dependent (BOLD) signal in Functional Magnetic Resonance Imaging (fMRI). The model poses a difficult parameter estimation problem, both theoretically due to the nonlinearity and divergence of the differential system, and computationally due to its time and space complexity. We adapt a particle filter and smoother to the task, and discuss some of the practical approaches used to tackle the difficulties, including use of sparse matrices and parallelisation. Results demonstrate the tractability of the approach in its application to an effective connectivity study. 1
4 0.48584035 189 nips-2007-Supervised Topic Models
Author: Jon D. Mcauliffe, David M. Blei
Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1
5 0.48533607 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation
Author: Leonid Sigal, Alexandru Balan, Michael J. Black
Abstract: Estimation of three-dimensional articulated human pose and motion from images is a central problem in computer vision. Much of the previous work has been limited by the use of crude generative models of humans represented as articulated collections of simple parts such as cylinders. Automatic initialization of such models has proved difficult and most approaches assume that the size and shape of the body parts are known a priori. In this paper we propose a method for automatically recovering a detailed parametric model of non-rigid body shape and pose from monocular imagery. Specifically, we represent the body using a parameterized triangulated mesh model that is learned from a database of human range scans. We demonstrate a discriminative method to directly recover the model parameters from monocular images using a conditional mixture of kernel regressors. This predicted pose and shape are used to initialize a generative model for more detailed pose and shape estimation. The resulting approach allows fully automatic pose and shape recovery from monocular and multi-camera imagery. Experimental results show that our method is capable of robustly recovering articulated pose, shape and biometric measurements (e.g. height, weight, etc.) in both calibrated and uncalibrated camera environments. 1
6 0.48202342 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
7 0.47031188 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
8 0.46753207 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
9 0.46597192 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
10 0.45639133 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
11 0.45458266 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
12 0.45396739 154 nips-2007-Predicting Brain States from fMRI Data: Incremental Functional Principal Component Regression
13 0.45363694 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
14 0.45214742 129 nips-2007-Mining Internet-Scale Software Repositories
15 0.45167738 63 nips-2007-Convex Relaxations of Latent Variable Training
16 0.45113617 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images
17 0.45028514 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
18 0.44922578 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
19 0.44822603 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
20 0.44644213 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images