nips nips2013 nips2013-36 nips2013-36-reference knowledge-graph by maker-knowledge-mining

36 nips-2013-Annealing between distributions by averaging moments


Source: pdf

Author: Roger B. Grosse, Chris J. Maddison, Ruslan Salakhutdinov

Abstract: Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. 1


reference text

[1] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Inf. Theory, 51(7):2282–2312, 2005.

[2] Martin J. Wainwright, Tommi Jaakkola, and Alan S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005.

[3] Amir Globerson and Tommi Jaakkola. Approximate Inference Using Conditional Entropy Decompositions. In 11th International Workshop on AI and Statistics (AISTATS’2007), 2007.

[4] Radford Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001.

[5] John Skilling. Nested sampling for general Bayesian computation. 1(4):833–859, 2006. Bayesian Analysis,

[6] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Methodology), 68(3):411–436, 2006.

[7] Jascha Sohl-Dickstein and Benjamin J. Culpepper. Hamiltonian annealed importance sampling for partition function estimation. Technical report, Redwood Center, UC Berkeley, 2012.

[8] Lucas Theis, Sebastian Gerwinn, Fabian Sinz, and Matthias Bethge. In all likelihood, deep belief is not enough. Journal of Machine Learning Research, 12:3071–3096, 2011.

[9] Ruslan Salakhutdinov and Ian Murray. On the quantitative analysis of deep belief networks. In Int’l Conf. on Machine Learning, pages 6424–6429, 2008.

[10] Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. On tracking the partition function. In NIPS 24. MIT Press, 2011.

[11] Graham Taylor and Geoffrey Hinton. Products of hidden Markov models: It takes N > 1 to tango. In Uncertainty in Artificial Intelligence, 2009.

[12] Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science, 13(2):163–186, 1998.

[13] Christopher Jarzynski. Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach. Physical Review E, 56:5018–5035, 1997.

[14] Daan Frenkel and Berend Smit. Understanding Molecular Simulation: From Algorithms to Applications. Academic Press, 2 edition, 2002.

[15] Radford Neal. Sampling from multimodal distributions using tempered transitions. Statistics and Computing, 6:353–366, 1996.

[16] Gundula Behrens, Nial Friel, and Merrilee Hurn. Tuning tempered transitions. Statistics and Computing, 22:65–78, 2012.

[17] Ben Calderhead and Mark Girolami. Estimating Bayes factors via thermodynamic integration and population MCMC. Computational Statistics and Data Analysis, 53(12):4028–4045, 2009.

[18] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.

[19] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.

[20] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Intl. Conf. on Machine Learning, 2008.

[21] Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.

[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[23] Y. Iba. Extended ensemble Monte Carlo. 12(5):623–656, 2001. 9 International Journal of Modern Physics C,