nips nips2008 nips2008-77 nips2008-77-reference knowledge-graph by maker-knowledge-mining

77 nips-2008-Evaluating probabilities under high-dimensional latent variable models

Source: pdf

Author: Iain Murray, Ruslan Salakhutdinov

Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1

reference text

[1] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of Deep Belief Networks. In Proceedings of the International Conference on Machine Learning, volume 25, pages 872–879, 2008.

[2] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.

[3] Tom Minka. Divergence measures and message passing. TR-2005-173, Microsoft Research, 2005.

[4] Michael A. Newton and Adrian E. Raftery. Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society, Series B (Methodological), 56(1):3–48, 1994.

[5] Thomas L. Grifﬁths, Mark Steyvers, David M. Blei, and Joshua B. Tenenbaum. Integrating topics and syntax. In Advances in Neural Information Processing Systems (NIPS*17). MIT Press, 2005.

[6] Hanna M. Wallach. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning, pages 977–984. ACM Press New York, NY, USA, 2006.

[7] Radford M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.

[8] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society B, 68(3):1–26, 2006.

[9] Siddhartha Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432):1313–1321, December 1995.

[10] Christian Ritter and Martin A. Tanner. Facilitating the Gibbs sampler: the Gibbs stopper and the griddyGibbs sampler. Journal of the American Statistical Association, 87(419):861–868, 1992.

[11] Siddhartha Chib and Ivan Jeliazkov. Marginal likelihood from the Metropolis–Hastings output. Journal of the American Statistical Association, 96(453), 2001.

[12] Antonietta Mira and Geoff Nicholls. Bridge estimation of the probability density at a point. Statistica Sinica, 14:603–612, 2004.

[13] Francesco Bartolucci, Luisa Scaccia, and Antonietta Mira. Efﬁcient Bayes factor estimation from the reversible jump output. Biometrika, 93(1):41–52, 2006.

[14] Radford M. Neal. Erroneous results in “Marginal likelihood from the Gibbs output”, 1999. Available from http://www.cs.toronto.edu/∼radford/chib-letter.html.

[15] Simon Osindero and Geoffrey Hinton. Modeling image patches with a directed hierarchy of Markov random ﬁelds. In Advances in Neural Information Processing Systems (NIPS*20). MIT Press, 2008.

[16] Aapo Hyv¨ rinen. Fast and robust ﬁxed-point algorithms for independent component analysis. IEEE a Transactions on Neural Networks, 10(3):626–634, 1999.

[17] Zoubin Ghahramani and Geoffrey E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto, 1997.

[18] Vibhav Gogate, Bozhena Bidyuk, and Rina Dechter. Studies in lower bounding probability of evidence using the Markov inequality. In 23rd Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2007. A Real-valued latents and Metropolis–Hastings There are technical difﬁculties with the original Chib-style approach applied to Metropolis–Hastings and continuous latent variables. The continuous version of equation (9), S 1 (20) P (h∗ |v) = T (h∗ ← h)P (h|v) dh ≈ S s=1 T (h∗ ← h(s) ), h(s) ∼ P(H), doesn’t work if T is the Metropolis–Hastings operator. The Dirac-delta function at h = h∗ contains a signiﬁcant part of the integral, which is ignored by samples from P (h|v) with probability one. Following [11], the ﬁx is to instead integrate over the generalized detailed balance relationship (5). Chib and Jeliazkov implicitly took out the h∗ = h point from all of their integrals. We do the same: P (h∗ |v) = dh T (h∗ ← h)P (h|v) dh T (h ← h∗ ). (21) ∗ The numerator can be estimated as before. As both integrals omit h = h , the denominator is less than one when T contains a delta function. For Metropolis–Hastings: T (h ← h∗ ) = q(h; h∗ ) min 1, a(h; h∗ ) , where a(h; h∗ ) is an easy-to-compute acceptance ratio. Sampling from q(h; h∗ ) and averaging min(1, a(h; h∗ )) provides an estimate of the denominator. In our importance sampling approach there is no need to separately approximate an additional quantity. The algorithm in ﬁgure 1 still applies if the T ’s are interpreted as probability density functions. If, due to a rejection, h∗ is drawn in step 2. then the sum in step 7. will contain an inﬁnite term giving a trivial underestimate P (v) = 0. (Steps 3–6 need not be performed in this case.) On repeated runs, the average estimate is still unbiased, or an underestimate for chains that can’t mix. Alternatively, the variational approach (19) could be applied together with Metropolis–Hastings sampling.