nips nips2011 nips2011-217 nips2011-217-reference knowledge-graph by maker-knowledge-mining

217 nips-2011-Practical Variational Inference for Neural Networks

Source: pdf

Author: Alex Graves

Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1

reference text

[1] D. Barber and C. M. Bishop. Ensemble learning in Bayesian neural networks., pages 215–237. SpringerVerlag, Berlin, 1998.

[2] D. Barber and B. Schottky. Radial basis functions: A bayesian treatment. In NIPS, 1997.

[3] G. E. Dahl, M. Ranzato, A. rahman Mohamed, and G. Hinton. Phone recognition with the meancovariance restricted boltzmann machine. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 469–477. 2010.

[4] DARPA-ISTO. The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT), speech disc cd1-1.1 edition, 1990.

[5] B. J. Frey. Graphical models for machine learning and digital communication. MIT Press, Cambridge, MA, USA, 1998.

[6] K. fu Lee and H. wuen Hon. Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1989.

[7] C. L. Giles and C. W. Omlin. Pruning recurrent neural networks for improved generalization performance. IEEE Transactions on Neural Networks, 5:848–851, 1994.

[8] A. Graves, S. Fern´ ndez, F. Gomez, and J. Schmidhuber. Connectionist temporal classiﬁcation: Laa belling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, ICML 2006, Pittsburgh, USA, 2006.

[9] A. Graves and J. Schmidhuber. Ofﬂine handwriting recognition with multidimensional recurrent neural networks. In NIPS, pages 545–552, 2008.

[10] G. E. Hinton and D. van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In COLT, pages 5–13, 1993.

[11] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.

[12] A. Honkela and H. Valpola. Variational learning and bits-back coding: An information-theoretic view to bayesian learning. IEEE Transactions on Neural Networks, 15:800–810, 2004.

[13] K.-C. Jim, C. Giles, and B. Horne. An analysis of noise in recurrent neural networks: convergence and generalization. Neural Networks, IEEE Transactions on, 7(6):1424 –1438, nov 1996.

[14] N. D. Lawrence. Variational Inference in Probabilistic Models. PhD thesis, University of Cambridge, 2000.

[15] Y. Le Cun, J. Denker, and S. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2, pages 598–605. Morgan Kaufmann, San Mateo, CA, 1990.

[16] D. J. C. MacKay. Probable networks and plausible predictions - a review of practical bayesian methods for supervised neural networks. Neural Computation, 1995.

[17] S. J. Nowlan and G. E. Hinton. Simplifying neural networks by soft weight sharing. Neural Computation, 4:173–193, 1992.

[18] M. Opper and C. Archambeau. The variational gaussian approximation revisited. Neural Computation, 21(3):786–792, 2009.

[19] D. Plaut, S. Nowlan, and G. E. Hinton. Experiments on learning by back propagation. Technical Report CMU-CS-86-126, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 1986.

[20] M. Riedmiller and T. Braun. A direst adaptive method for faster backpropagation learning: The rprop algorithm. In International Symposium on Neural Networks, 1993.

[21] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465 – 471, 1978.

[22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors, pages 696–699. MIT Press, Cambridge, MA, USA, 1988.

[23] C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27, 1948.

[24] P. Smolensky. Information processing in dynamical systems: foundations of harmony theory, pages 194– 281. MIT Press, Cambridge, MA, USA, 1986.

[25] C. S. Wallace. Classiﬁcation by minimum-message-length inference. In Proceedings of the international conference on Advances in computing and information, ICCI’90, pages 72–81, New York, NY, USA, 1990. Springer-Verlag New York, Inc.

[26] I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data compression. Commun. ACM, 30:520–540, June 1987. 9