nips nips2011 nips2011-197 nips2011-197-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Guillaume Desjardins, Yoshua Bengio, Aaron C. Courville
Abstract: Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change ∆Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone. 1
[1] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009.
[2] Bennett, C. (1976). Efficient estimation of free energy differences from Monte Carlo data. Journal of Computational Physics, 22(2), 245–268.
[3] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral.
[4] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
[5] Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced gradient and adaptive learning rate for training restricted boltzmann machines. In L. Getoor and T. Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 105–112, New York, NY, USA. ACM.
[6] Desjardins, G., Courville, A., and Bengio, Y. (2010a). Adaptive parallel tempering for stochastic maximum likelihood learning of rbms. NIPS*2010 Deep Learning and Unsupervised Feature Learning Workshop.
[7] Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau, O. (2010b). Tempered Markov chain monte carlo for training of restricted Boltzmann machine. In JMLR W&CP;: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9, pages 145–152.
[8] Hinton, G. (2010). A practical guide to training restricted boltzmann machines. Technical Report 2010003, University of Toronto. version 1.
[9] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
[10] Iba, Y. (2001). Extended ensemble monte carlo. International Journal of Modern Physics, C12, 623–656.
[11] Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS’2011), volume 15 of JMLR: W&CP.;
[12] Larochelle, H., Bengio, Y., and Turian, J. (2010). Tractable multivariate binary density estimation and the restricted Boltzmann forest. Neural Computation, 22(9), 2285–2307.
[13] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient based learning applied to document recognition. IEEE, 86(11), 2278–2324.
[14] Lingenheil, M., Denschlag, R., Mathias, G., and Tavan, P. (2009). Efficiency of exchange schemes in replica exchange. Chemical Physics Letters, 478(1-3), 80 – 84.
[15] Marinari, E. and Parisi, G. (1992). Simulated tempering: A new monte carlo scheme. EPL (Europhysics Letters), 19(6), 451.
[16] Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2009). Inductive principles for restricted boltzmann machine learning. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS’10), volume 9, pages 509–516.
[17] Murray, I. and Ghahramani, Z. (2004). Bayesian learning in undirected graphical models: Approximate mcmc algorithms.
[18] Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2), 125–139.
[19] Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance sampling.
[20] Salakhutdinov, R. (2010a). Learning deep boltzmann machines using adaptive mcmc. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML-10), volume 1, pages 943–950. ACM.
[21] Salakhutdinov, R. (2010b). Learning in Markov random fields using tempered transitions. In NIPS’09.
[22] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, volume 5, pages 448–455.
[23] Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, volume 25, pages 872–879. ACM.
[24] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In L. Bottou and M. Littman, editors, ICML 2009, pages 1025–1032. ACM.
[25] Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, pages 1064–1071. ACM.
[26] Tieleman, T. and Hinton, G. (2009). Using fast weights to improve persistent contrastive divergence. In L. Bottou and M. Littman, editors, ICML 2009, pages 1033–1040. ACM.
[27] Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval. In NIPS’04, volume 17, Cambridge, MA. MIT Press. 9