nips nips2013 nips2013-312 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sam Patterson, Yee Whye Teh
Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. [sent-9, score-0.206]
2 We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. [sent-10, score-0.326]
3 vectors lying in the probability simplex πk = 1} ⊂ RK ∆K = {(π1 , . [sent-13, score-0.111]
4 , πK ) : πk ≥ 0, (1) k Important examples include topic models like latent Dirichlet allocation (LDA) [BNJ03], admixture models in genetics like Structure [PSD00], and discrete directed graphical models with a Bayesian prior over the conditional probability tables [Hec99]. [sent-16, score-0.261]
5 Standard approaches to inference over the probability simplex include variational inference [Bea03, WJ08] and Markov chain Monte Carlo methods (MCMC) like Gibbs sampling [GRS96]. [sent-17, score-0.333]
6 variational inference [BNJ03], collapsed variational inference [TNW07, AWST09] and collapsed Gibbs sampling [GS04]. [sent-20, score-0.403]
7 With the increasingly large scale document corpora to which LDA and other topic models are applied, there has also been developments of specialised and highly scalable algorithms [NASW09]. [sent-21, score-0.212]
8 Most proposed algorithms are based on a batch learning framework, where the whole document corpus needs to be stored and accessed for every iteration. [sent-22, score-0.141]
9 In some scenarios, and particularly in LDA, MCMC algorithms have been shown to work extremely well, and in fact achieve better results faster than variational inference on small to medium corpora [GS04, TNW07, AWST09]. [sent-28, score-0.181]
10 We will make use of a recently developed MCMC method called stochastic gradient Langevin dynamics (SGLD) [WT11, ABW12] which operates in a similar online minibatch framework as OVB. [sent-30, score-0.38]
11 Unlike OVB and other stochastic gradient descent algorithms, SGLD is not a gradient descent algorithm. [sent-31, score-0.246]
12 Rather, it is a Hamiltonian MCMC [Nea10] algorithm which will asymptotically produce samples from the posterior distribution. [sent-32, score-0.09]
13 It achieves this by updating parameters according to both the stochastic gradients as well as additional noise which forces it to explore the full posterior instead of simply converging to a MAP configuration. [sent-33, score-0.121]
14 Firstly, the probability simplex (1) is compact and has boundaries that has to be accounted for when an update proposes a step that brings the vector outside the simplex. [sent-35, score-0.146]
15 Secondly, the typical Dirichlet priors over the probability simplex place most of its mass close to the boundaries and corners of the simplex. [sent-36, score-0.168]
16 This also causes a problem as depending on the parameterisation used, the gradient required for Langevin dynamics is inversely proportional to entries in π and hence can blow up when components of π are close to zero. [sent-40, score-0.482]
17 These considerations lead us to the first contribution of this paper in Section 3, which is an investigation into different ways to parameterise the probability simplex. [sent-42, score-0.078]
18 This section shows that the choice of a good parameterisation is not obvious, and that the use of the Riemannian geometry of the simplex [Ama95, GC11] is important in designing Langevin MCMC algorithms. [sent-43, score-0.363]
19 In particular, we show that an unnormalized parameterisation, using a mirroring trick to remove boundaries, coupled with a natural gradient update, achieves the best mixing performance. [sent-44, score-0.173]
20 In Section 4, we then show that the SGLD algorithm, using this parameterisation and natural gradient updates, performs significantly better than OVB algorithms [HBB10, MHB12]. [sent-45, score-0.338]
21 1 Review Langevin dynamics N Suppose we model a data set x = x1 , . [sent-48, score-0.169]
22 , xN , with a generative model p(x | θ) = i=1 p(xi | θ) parameterized by θ ∈ RD with prior p(θ) and that our aim is to compute the posterior p(θ | x). [sent-51, score-0.094]
23 Langevin dynamics [Ken90, Nea10] is an MCMC scheme which produces samples from the posterior by means of gradient updates plus Gaussian noise, resulting in a proposal distribution q(θ∗ | θ) as described by Equation 2. [sent-52, score-0.448]
24 N θ∗ = θ + 2 θ log p(θ) θ log p(xi |θ) + + ζ, ζ ∼ N (0, I) (2) i=1 The mean of the proposal distribution is in the direction of increasing log posterior due to the gradient, while the added noise will prevent the samples from collapsing to a single (local) maximum. [sent-53, score-0.152]
25 A Metropolis-Hastings correction step is required to correct for discretisation error, with proposals ∗ ∗ accepted with probability min 1, p(θ |x) q(θ|θ ) [RS02]. [sent-54, score-0.099]
26 As tends to zero, the acceptance ratio p(θ|x) q(θ ∗ |θ) tends to one as the Markov chain tends to a stochastic differential equation which has p(θ | x) as its stationary distribution [Ken78]. [sent-55, score-0.16]
27 2 Riemannian Langevin dynamics Langevin dynamics has an isotropic proposal distribution leading to slow mixing if the components of θ have very different scales or if they are highly correlated. [sent-57, score-0.424]
28 We will refer to their algorithm 2 as Riemannian Langevin dynamics (RLD) in this paper. [sent-60, score-0.169]
29 As in Langevin dynamics, RLD consists of a Gaussian proposal q(θ∗ | θ), along with a MetropolisHastings correction step. [sent-64, score-0.114]
30 Whereas the standard gradient gives the direction of steepest ascent in Euclidean space, the natural gradient gives the direction of steepest descent taking into account the geometry implied by G(θ). [sent-66, score-0.249]
31 3 Stochastic gradient Riemannian Langevin dynamics In the Langevin dynamics and RLD algorithms, the proposal distribution requires calculation of the gradient of the log likelihood w. [sent-70, score-0.59]
32 The stochastic gradient Langevin dynamics (SGLD) algorithm [WT11] replaces the calculation of the gradient over the full data set, with a stochastic approximation based on a subset of data. [sent-75, score-0.471]
33 Convergence to the posterior is still guaranteed as long as decaying step sizes satis∞ ∞ fying t=1 t = ∞, t=1 2 < ∞ are used. [sent-77, score-0.09]
34 t In this paper we combine the use of a preconditioning matrix G(θ) as in RLD with this stochastic gradient approximation, by replacing the exact gradient in Equation 4 with the approximation from Equation 5. [sent-78, score-0.272]
35 The resulting algorithm, stochastic gradient Riemannian Langevin dynamics (SGRLD), avoids the slow mixing problems of Langevin dynamics, while still being applicable in a large scale online setting due to its use of stochastic gradients and lack of Metropolis-Hastings correction steps. [sent-79, score-0.49]
36 3 Riemannian Langevin dynamics on the probability simplex In this section, we investigate the issues which arise when applying Langevin Monte Carlo methods, specifically the Langevin dynamics and Riemannian Langevin dynamics algorithms, to models whose parameters lie on the probability simplex. [sent-80, score-0.618]
37 K n N This results in a Dirchlet posterior p(π | x) ∝ k πk k +αk −1 , where nk = i=1 δ(xi = k). [sent-86, score-0.094]
38 We consider both the mean and natural parameter spaces, and in each of these we try both a reduced (K − 1 dimensional) and expanded (K dimensional) parameterisation, with details as follows. [sent-92, score-0.078]
39 Running Langevin dynamics or RLD on the full K dimensional parameterisation will result in proposals that are off the simplex with probability one. [sent-95, score-0.545]
40 Note however that the proposals can still violate the boundary constraint 0 < πk < 1, and this is particularly problematic when the posterior has mass close to the boundaries. [sent-97, score-0.183]
41 Reduced-Natural: in the natural parameter space, the reduced parameterisation takes the form θk πk = 1+ e eθk for k = 1, . [sent-104, score-0.243]
42 θk e Expanded-Natural: finally the expanded-natural parameterisation takes the form πk = K eθk k=1 for k = 1, . [sent-110, score-0.218]
43 For all parameterisations, we run both Langevin dynamics and RLD. [sent-115, score-0.169]
44 For the reduced parameterisations, we can use the expected Fisher information matrix, but the redundancy in the full parameterisations means that this matrix has rank K−1 and hence is not invertible. [sent-117, score-0.125]
45 For these parameterisations we use the expected Fisher information matrix for a Gamma/Poisson model, which is equivalent to the Dirichlet/Multinomial apart from the fact that the total number of data items is considered to be random as well. [sent-118, score-0.161]
46 The details for each parameterisation are summarised in Table 1. [sent-119, score-0.218]
47 In all cases we are interested in sampling from the posterior distribution on π, while θ is the specific parameterisation being used. [sent-120, score-0.312]
48 For the mean parameterisations, the θ−1 term in the gradient of the log-posterior means that for components of θ which are close to zero, the proposal distribution for Langevin dynamics (Equation 2) has a large mean, resulting in unstable proposals with a small acceptance probability. [sent-121, score-0.404]
49 Due to the form of G(θ)−1 , the same argument holds for the RLD proposal distribution for the natural parameterisations. [sent-122, score-0.087]
50 This leaves us with three possible combinations, RLD on the expandedmean parameterisation and Langevin dynamics on each of the natural parameterisations. [sent-123, score-0.412]
51 To investigate their relative performances we run a small experiment, producing 110,000 samples from each of the three remaining parameterisations, discarding 10,000 burn-in samples and thinning the remaining samples by a factor of 100. [sent-126, score-0.102]
52 For the resulting 1000 thinned samples of θ, we calculate the corresponding samples of π, and compute the effective sample size for each component of π. [sent-127, score-0.079]
53 The samples from Langevin dynamics on both natural parameterisations display higher auto-correlation than the RLD samples produced using the expanded-mean parameterisation, as would be expected from their lower effective sample sizes. [sent-131, score-0.396]
54 In addition to the increased effective sample size, the expanded-mean parameterisation RLD sampler has the advantage that it is computationally efficient as G(θ) is a diagonal matrix. [sent-132, score-0.218]
55 Hence it is this algorithm that we use when applying these techniques to latent Dirichlet allocation in Section 4. [sent-133, score-0.127]
56 4 Applying Riemannian Langevin dynamics to latent Dirichlet allocation Latent Dirichlet Allocation (LDA) [BNJ03] is a hierarchical Bayesian model, most frequently used to model topics arising in collections of text documents. [sent-134, score-0.337]
57 A document d is then modelled by a mixture of topics, with mixing proportion ηd , drawn from a symmetric Dirichlet prior with hyper-parameter α. [sent-136, score-0.135]
58 The model corresponds to a generative process where documents are produced by drawing a topic assignment zdi i. [sent-137, score-0.311]
59 from ηd for each word wdi in document d, and then drawing the word wdi from the corresponding topic πzdi . [sent-140, score-0.451]
60 , and we can factorise Equation 6 D p(w, z, π | α, β) = p(π | β) p(wd , zd | α, π) (7) d=1 where K p(wd , zd , | α, π) = k=1 W Γ (α + ndk· ) ndkw πkw Γ (α) w=1 5 (8) 4. [sent-145, score-0.313]
61 1 Stochastic gradient Riemannian Langevin dynamics for LDA As we would like to apply these techniques to large document collections, we use the stochastic gradient version of the Riemannian Langevin dynamics algorithm, as detailed in Section 2. [sent-146, score-0.666]
62 The algorithm runs on mini-batches of documents: at time t it receives a mini-batch of documents indexed by Dt , drawn at random from the full corpus D. [sent-164, score-0.129]
63 The stochastic gradient of the log posterior of θ on Dt is shown in Equation 9. [sent-165, score-0.216]
64 Comparing Equation 9 to the gradient for the simple model from Section 3, the observed counts nk for the simple model have been replaced with the expectation of the latent topic assignment counts ndkw . [sent-169, score-0.368]
65 We compare it to two online variational Bayesian algorithms developed for latent Dirichlet allocation: online variational Bayes (OVB) [HBB10] and hybrid stochastic variational-Gibbs (HSVG) [MHB12]. [sent-172, score-0.382]
66 The difference between these two methods is the form of variational assumption made. [sent-173, score-0.101]
67 OVB assumes a mean-field variational posterior, q(η1:D , z1:D , π1:K ) = d q(ηd ) d,i q(zdi ) k q(πk ), in particular this means topic assignment variables within the same document are assumed to be independent, when in reality they will be strongly coupled. [sent-174, score-0.29]
68 In contrast HSVG collapses ηd analytically and uses a variational posterior of the form q(z1:D , π1:K ) = d q(zd ) k q(πk ), which allows dependence within the components of zd . [sent-175, score-0.301]
69 This more complicated posterior requires Gibbs sampling in the variational update step for zd , and we combined the code for OVB [HBB10], with the Gibbs sampling routine from our SGRLD code to implement HSVG. [sent-176, score-0.336]
70 For a held-out document wd and a training set W, the perplexity is given by nd·· log p(wdi | W, α, β) . [sent-180, score-0.29]
71 To calculate the perplexity for SGRLD, we integrate η analytically, so Equation 13 is replaced by train train p(wdi | wd , W, α, β) = Eπ|W,β Ezd |π,α ηdk πkwdi ˆ (14) k where test train ηdk := p(zdi = k | zd , α) = ˆ ntrain + α dk· . [sent-185, score-0.442]
72 ntrain + Kα d·· (15) We estimate these expectations using the samples we obtain for π from the Markov chain produced train train by SGRLD, and samples for zd produced by Gibbs sampling the topic assignments on wd . [sent-186, score-0.586]
73 p(wdi | W, α, β) = Ep(ηd ,π|W,α,β) ηdk πkwdi ≈ k Eq(ηd ) [ηdk ] Eq(πk ) [πkwdi ] (16) k We estimate the perplexity directly rather than use a variational bound [HBB10] so that we can compare results of the variational algorithms to those of SGRLD. [sent-188, score-0.269]
74 This corpus contains 2483 documents, which is small enough to run all three algorithms in batch mode and compare their performance to that of collapsed Gibbs sampling on the full collection. [sent-191, score-0.141]
75 Each document was split 80/20 into training and test sets, the training portion of all 2483 documents were used in each update step, and the perplexity was calculated on the test portion of all documents. [sent-192, score-0.219]
76 Perplexities were estimated for a range of step size parameters, and for 1, 5 and 10 document updates per topic parameter update. [sent-196, score-0.197]
77 For OVB the document updates are fixed point iterations of q(zd ) while for HSVG and SGRLD they are Gibbs updates of zd , the first half of which were discarded as burn-in. [sent-197, score-0.258]
78 These numbers of document updates were chosen as previous investigation of the performance of HSVG for varying numbers of Gibbs updates has shown that 6-10 updates are sufficient [MHB12] to achieve good performance. [sent-198, score-0.203]
79 Figure 2(a) shows the lowest perplexities achieved along with the corresponding parameter settings. [sent-199, score-0.089]
80 As it uses a less restricted variational distribution it should perform at least as well. [sent-202, score-0.101]
81 3 Results on Wikipedia corpus The algorithms’ performances in an online scenario was assessed on a set of articles downloaded at random from Wikipedia, as in [HBB10]. [sent-205, score-0.128]
82 150,000 documents from Wikipedia were used in total, in minibatches of 50 documents each. [sent-208, score-0.14]
83 As the corpus size is large, collapsed Gibbs sampling was not run on this data set. [sent-212, score-0.141]
84 For each algorithm a grid-search was run on the hyper-parameters, step-size parameters, and number of Gibbs sampling sweeps / variational fixed point iterations per π update. [sent-213, score-0.13]
85 The lowest three perplexities attained for each algorithm are shown in Figure 2(b). [sent-214, score-0.089]
86 The performance of SGRLD is a substantial improvement on both the variational algorithms. [sent-217, score-0.101]
87 Using an expanded parametrisation with a reflection trick for negative proposals removed the need to deal with boundary constraints, and using the Riemannian geometry of the parameter space dealt with the problem of parameters with differing scales. [sent-219, score-0.183]
88 Due to the widespread use of models defined on the probability simplex, we believe the methods developed here for Langevin dynamics on the probability simplex will find further uses beyond latent Dirichlet allocation and stochastic gradient Monte Carlo methods. [sent-227, score-0.558]
89 8 References [ABW12] Sungjin Ahn, Anoop Korattikara Balan, and Max Welling, Bayesian posterior sampling via stochastic gradient fisher scoring. [sent-230, score-0.245]
90 Welling, Bayesian posterior sampling via stochastic gradient Fisher scoring, Proceedings of the International Conference on Machine Learning, 2012. [sent-235, score-0.245]
91 Teh, On smoothing and inference for topic models, Proceedings of the International Conference on Uncertainty in Artificial Intelligence, vol. [sent-244, score-0.116]
92 Spiegelhalter, Markov chain monte carlo in practice, Chapman and Hall, 1996. [sent-271, score-0.126]
93 Bach, Online learning for latent dirichlet allocation, Advances in Neural Information Processing Systems, 2010. [sent-281, score-0.155]
94 Blei, Sparse stochastic inference for latent Dirichlet allocation, Proceedings of the International Conference on Machine Learning, 2012. [sent-295, score-0.137]
95 Welling, Distributed algorithms for topic models, Journal of Machine Learning Research (2009). [sent-300, score-0.083]
96 Sato, Online model selection based on the variational Bayes, Neural Computation 13 (2001), 1649–1681. [sent-325, score-0.101]
97 Welling, A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation, Advances in Neural Information Processing Systems, vol. [sent-330, score-0.235]
98 Jordan, Graphical models, exponential families, and variational inference, Foundations and Trends in Machine Learning 1 (2008), no. [sent-337, score-0.101]
99 Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno, Evaluation methods for topic models, Proceedings of the 26th International Conference on Machine Learning (ICML) (Montreal) (L´ on Bottou and Michael Littman, eds. [sent-340, score-0.083]
100 Teh, Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the International Conference on Machine Learning, 2011. [sent-346, score-0.151]
wordName wordTfidf (topN-words)
[('langevin', 0.536), ('kw', 0.233), ('ovb', 0.232), ('parameterisation', 0.218), ('rld', 0.214), ('sgrld', 0.196), ('hsvg', 0.178), ('dynamics', 0.169), ('riemannian', 0.165), ('sgld', 0.143), ('wdi', 0.143), ('wd', 0.141), ('parameterisations', 0.125), ('zd', 0.112), ('simplex', 0.111), ('dirichlet', 0.107), ('kwdi', 0.107), ('ndk', 0.107), ('zdi', 0.107), ('variational', 0.101), ('gradient', 0.095), ('ndkw', 0.089), ('perplexities', 0.089), ('topic', 0.083), ('document', 0.082), ('allocation', 0.079), ('lda', 0.077), ('wikipedia', 0.07), ('documents', 0.07), ('gibbs', 0.069), ('perplexity', 0.067), ('mcmc', 0.066), ('posterior', 0.065), ('proposal', 0.062), ('welling', 0.062), ('dt', 0.061), ('corpus', 0.059), ('stochastic', 0.056), ('ezd', 0.053), ('parameterise', 0.053), ('collapsed', 0.053), ('expanded', 0.053), ('correction', 0.052), ('carlo', 0.05), ('monte', 0.05), ('diag', 0.05), ('boundary', 0.049), ('dk', 0.049), ('latent', 0.048), ('ld', 0.048), ('corpora', 0.047), ('proposals', 0.047), ('equation', 0.047), ('manifold', 0.043), ('topics', 0.041), ('fisher', 0.039), ('jk', 0.038), ('online', 0.038), ('hamiltonian', 0.037), ('korattikara', 0.036), ('items', 0.036), ('boundaries', 0.035), ('geometry', 0.034), ('bayesian', 0.033), ('inference', 0.033), ('updates', 0.032), ('diffusions', 0.031), ('riemann', 0.031), ('teh', 0.031), ('acceptance', 0.031), ('articles', 0.031), ('train', 0.031), ('prior', 0.029), ('mirroring', 0.029), ('thinned', 0.029), ('ntrain', 0.029), ('ahn', 0.029), ('sampling', 0.029), ('gatsby', 0.029), ('nk', 0.029), ('thinning', 0.027), ('produced', 0.027), ('preconditioning', 0.026), ('quantum', 0.026), ('chain', 0.026), ('investigation', 0.025), ('samples', 0.025), ('natural', 0.025), ('sizes', 0.025), ('median', 0.025), ('assignment', 0.024), ('mimno', 0.024), ('mixing', 0.024), ('calculating', 0.023), ('analytically', 0.023), ('minibatch', 0.022), ('genetics', 0.022), ('gamma', 0.022), ('blei', 0.022), ('mass', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
Author: Sam Patterson, Yee Whye Teh
Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1
2 0.11473642 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
Author: Dahua Lin
Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1
3 0.11183513 317 nips-2013-Streaming Variational Bayes
Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan
Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1
4 0.10519239 256 nips-2013-Probabilistic Principal Geodesic Analysis
Author: Miaomiao Zhang, P.T. Fletcher
Abstract: Principal geodesic analysis (PGA) is a generalization of principal component analysis (PCA) for dimensionality reduction of data on a Riemannian manifold. Currently PGA is defined as a geometric fit to the data, rather than as a probabilistic model. Inspired by probabilistic PCA, we present a latent variable model for PGA that provides a probabilistic framework for factor analysis on manifolds. To compute maximum likelihood estimates of the parameters in our model, we develop a Monte Carlo Expectation Maximization algorithm, where the expectation is approximated by Hamiltonian Monte Carlo sampling of the latent variables. We demonstrate the ability of our method to recover the ground truth parameters in simulated sphere data, as well as its effectiveness in analyzing shape variability of a corpus callosum data set from human brain images. 1
5 0.10276114 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
6 0.096181333 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
7 0.095763423 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
8 0.09225411 174 nips-2013-Lexical and Hierarchical Topic Regression
9 0.075905472 301 nips-2013-Sparse Additive Text Models with Low Rank Background
10 0.074088119 75 nips-2013-Convex Two-Layer Modeling
11 0.073631659 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
12 0.072127305 98 nips-2013-Documents as multiple overlapping windows into grids of counts
13 0.071130201 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
14 0.070366167 161 nips-2013-Learning Stochastic Inverses
15 0.069968626 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
16 0.064770333 143 nips-2013-Integrated Non-Factorized Variational Inference
17 0.062836714 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
18 0.06230133 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
19 0.060598813 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
20 0.060057022 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
topicId topicWeight
[(0, 0.15), (1, 0.043), (2, -0.04), (3, -0.002), (4, -0.003), (5, 0.089), (6, 0.113), (7, 0.052), (8, 0.175), (9, -0.073), (10, -0.012), (11, 0.104), (12, -0.002), (13, 0.03), (14, 0.004), (15, -0.06), (16, 0.075), (17, -0.042), (18, -0.023), (19, -0.022), (20, -0.005), (21, -0.083), (22, 0.004), (23, 0.022), (24, -0.009), (25, -0.055), (26, 0.109), (27, -0.06), (28, -0.032), (29, 0.032), (30, 0.01), (31, -0.021), (32, -0.037), (33, -0.034), (34, 0.01), (35, -0.047), (36, 0.075), (37, -0.019), (38, -0.003), (39, -0.056), (40, 0.04), (41, -0.037), (42, -0.064), (43, -0.004), (44, 0.011), (45, -0.026), (46, -0.051), (47, 0.013), (48, 0.026), (49, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.92878246 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
Author: Sam Patterson, Yee Whye Teh
Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1
2 0.84036088 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
Author: Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang
Abstract: Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. 1
3 0.75954044 317 nips-2013-Streaming Variational Bayes
Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan
Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1
4 0.7053017 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
Author: Chong Wang, Xi Chen, Alex Smola, Eric Xing
Abstract: Stochastic gradient optimization is a class of widely used algorithms for training machine learning models. To optimize an objective, it uses the noisy gradient computed from the random data samples instead of the true gradient computed from the entire dataset. However, when the variance of the noisy gradient is large, the algorithm might spend much time bouncing around, leading to slower convergence and worse performance. In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. Data statistics such as low-order moments (pre-computed or estimated online) is used to form the control variate. We demonstrate how to construct the control variate for two practical problems using stochastic gradient optimization. One is convex—the MAP estimation for logistic regression, and the other is non-convex—stochastic variational inference for latent Dirichlet allocation. On both problems, our approach shows faster convergence and better performance than the classical approach. 1
5 0.68599033 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
Author: Dahua Lin
Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1
6 0.68094403 301 nips-2013-Sparse Additive Text Models with Low Rank Background
7 0.66930038 174 nips-2013-Lexical and Hierarchical Topic Regression
8 0.5797981 86 nips-2013-Demixing odors - fast inference in olfaction
9 0.56827575 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
10 0.56797397 98 nips-2013-Documents as multiple overlapping windows into grids of counts
11 0.56575775 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
12 0.55960667 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
13 0.54691392 70 nips-2013-Contrastive Learning Using Spectral Methods
14 0.54686928 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
15 0.50764352 256 nips-2013-Probabilistic Principal Geodesic Analysis
16 0.50084239 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
17 0.49064246 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
18 0.49033037 161 nips-2013-Learning Stochastic Inverses
20 0.4799242 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
topicId topicWeight
[(4, 0.282), (16, 0.05), (33, 0.121), (34, 0.158), (36, 0.033), (41, 0.027), (49, 0.05), (56, 0.067), (70, 0.011), (85, 0.035), (89, 0.019), (93, 0.06), (95, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.77521193 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
Author: Sam Patterson, Yee Whye Teh
Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1
2 0.60894024 317 nips-2013-Streaming Variational Bayes
Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan
Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1
3 0.60519123 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan
Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1
4 0.60371315 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
Author: Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang
Abstract: Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. 1
5 0.60216004 148 nips-2013-Latent Maximum Margin Clustering
Author: Guang-Tong Zhou, Tian Lan, Arash Vahdat, Greg Mori
Abstract: We present a maximum margin framework that clusters data using latent variables. Using latent representations enables our framework to model unobserved information embedded in the data. We implement our idea by large margin learning, and develop an alternating descent algorithm to effectively solve the resultant non-convex optimization problem. We instantiate our latent maximum margin clustering framework with tag-based video clustering tasks, where each video is represented by a latent tag model describing the presence or absence of video tags. Experimental results obtained on three standard datasets show that the proposed method outperforms non-latent maximum margin clustering as well as conventional clustering approaches. 1
6 0.60169536 143 nips-2013-Integrated Non-Factorized Variational Inference
7 0.6007635 5 nips-2013-A Deep Architecture for Matching Short Texts
8 0.60051179 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
9 0.59933668 201 nips-2013-Multi-Task Bayesian Optimization
10 0.59791481 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
11 0.59740877 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
12 0.59726268 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
13 0.59688586 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
14 0.59648687 86 nips-2013-Demixing odors - fast inference in olfaction
15 0.5964033 173 nips-2013-Least Informative Dimensions
16 0.59326035 347 nips-2013-Variational Planning for Graph-based MDPs
17 0.59272307 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
18 0.59236354 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
19 0.59151161 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
20 0.59144372 183 nips-2013-Mapping paradigm ontologies to and from the brain