jmlr jmlr2010 jmlr2010-117 jmlr2010-117-reference knowledge-graph by maker-knowledge-mining

117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?


Source: pdf

Author: Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio

Abstract: Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training. Keywords: deep architectures, unsupervised pre-training, deep belief networks, stacked denoising auto-encoders, non-convex optimization


reference text

Shun-ichi Amari, Noboru Murata, Klaus-Robert M¨ ller, Michael Finke, and Howard Hua Yang. u Asymptotic statistical theory of overtraining and cross-validation. IEEE Transactions on Neural Networks, 8(5):985–996, 1997. Lalit Bahl, Peter Brown, Peter deSouza, and Robert Mercer. Maximum mutual information estimation of hidden markov parameters for speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 49–52, Tokyo, Japan, 1986. Andrew E. Barron. Complexity regularization with application to artificial neural networks. In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 561–576. Kluwer Academic Publishers, 1991. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS’01), Cambridge, MA, 2002. MIT Press. Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Also published as a book. Now Publishers, 2009. Yoshua Bengio and Olivier Delalleau. Justifying and generalizing contrastive divergence. Neural Computation, 21(6):1601–1621, June 2009. Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, pages 321–360. MIT Press, 2007. Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. The curse of highly variable functions for local kernel machines. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neuo ral Information Processing Systems 18 (NIPS’05), pages 107–114. MIT Press, Cambridge, MA, 2006. 656 W HY D OES U NSUPERVISED P RE - TRAINING H ELP D EEP L EARNING ? Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Bernhard Sch¨ lkopf, John Platt, and Thomas Hoffman, editors, Advances o in Neural Information Processing Systems 19 (NIPS’06), pages 153–160. MIT Press, 2007. Marc H. Bornstein. Sensitive periods in development : interdisciplinary perspectives / edited by Marc H. Bornstein. Lawrence Erlbaum Associates, Hillsdale, N.J. :, 1987. Olivier Chapelle, Jason Weston, and Bernhard Sch¨ lkopf. Cluster kernels for semi-supervised learno ing. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15 (NIPS’02), pages 585–592, Cambridge, MA, 2003. MIT Press. Olivier Chapelle, Bernhard Sch¨ lkopf, and Alexander Zien. Semi-Supervised Learning. MIT Press, o 2006. Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), pages 160–167. ACM, 2008. Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Technical Report 1341, Universit´ de Montr´ al, 2009. e e Patrick Gallinari, Yann LeCun, Sylvie Thiria, and Francoise Fogelman-Soulie. Memoires associatives distribuees. In Proceedings of COGNITIVA 87, Paris, La Villette, 1987. Ian Goodfellow, Quoc Le, Andrew Saxe, and Andrew Ng. Measuring invariances in deep networks. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 646–654. 2009. Raia Hadsell, Ayse Erkan, Pierre Sermanet, Marco Scoffier, Urs Muller, and Yann LeCun. Deep belief net learning in a long-range vision system for autonomous off-road driving. In Proc. Intelligent Robots and Systems (IROS’08), pages 628–633, 2008. Johan H˚ stad. Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California, 1986. ACM Press. Johan H˚ stad and Mikael Goldmann. On the power of small-depth threshold circuits. Computaa tional Complexity, 1:113–129, 1991. Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. Geoffrey E. Hinton. To recognize shapes, first learn to generate images. In Paul Cisek, Trevor Drew, and John Kalaska, editors, Computational Neuroscience: Theoretical Insights into Brain Function. Elsevier, 2007. Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006. 657 E RHAN , B ENGIO , C OURVILLE , M ANZAGOL , V INCENT AND B ENGIO Goeffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006. Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann machines. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), pages 536–543. ACM, 2008. Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Int. Conf. Mach. Learn., pages 473–480, 2007. Hugo Larochelle, Yoshua Bengio, Jerome Louradour, and Pascal Lamblin. Exploring strategies for training deep neural networks. The Journal of Machine Learning Research, 10:1–40, January 2009. Julia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of generative and discriminative models. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR’06), pages 87–94, Washington, DC, USA, 2006. IEEE Computer Society. Yann LeCun. Mod` les connexionistes de l’apprentissage. PhD thesis, Universit´ de Paris VI, 1987. e e Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied e to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Honglak Lee, Chaitanya Ekanadham, and Andrew Ng. Sparse deep belief net model for visual area V2. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS’07), pages 873–880. MIT Press, Cambridge, MA, 2008. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In L´ on Bottou and e Michael Littman, editors, Proceedings of the Twenty-sixth International Conference on Machine Learning (ICML’09). ACM, Montreal (Qc), Canada, 2009. Ga¨ lle Loosli, St´ phane Canu, and L´ on Bottou. Training invariant support vector machines use e e ing selective sampling. In L´ on Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, e editors, Large Scale Kernel Machines, pages 301–320. MIT Press, Cambridge, MA., 2007. Hossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence in video. In L´ on Bottou and Michael Littman, editors, Proceedings of the 26th International e Conference on Machine Learning, pages 737–744, Montreal, June 2009. Omnipress. Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS’01), pages 841–848, 2002. Simon Osindero and Geoffrey E. Hinton. Modeling image patches with a directed hierarchy of markov random field. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1121–1128, Cambridge, MA, 2008. MIT Press. 658 W HY D OES U NSUPERVISED P RE - TRAINING H ELP D EEP L EARNING ? Dan Povey and Philip C. Woodland. Minimum phone error and i-smoothing for improved discriminative training. In Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP ’02). IEEE International Conference on, volume 1, pages I–105–I–108 vol.1, 2002. Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. Efficient learning of sparse representations with an energy-based model. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, o editors, Advances in Neural Information Processing Systems 19 (NIPS’06), pages 1137–1144. MIT Press, 2007. Marc’Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. Sparse feature learning for deep belief networks. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1185–1192, Cambridge, MA, 2008. MIT Press. Ruslan Salakhutdinov and Geoffrey E. Hinton. Using deep belief nets to learn covariance kernels for Gaussian processes. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1249–1256, Cambridge, MA, 2008. MIT Press. Ruslan Salakhutdinov and Geoffrey E. Hinton. Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam, 2007. Elsevier. Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. Restricted Boltzmann machines for collaborative filtering. In Zoubin Ghahramani, editor, Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML’07), pages 791–798, New York, NY, USA, 2007. ACM. Sebastian H. Seung. Learning continuous attractors in recurrent networks. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems 10 (NIPS’97), pages 654–660. MIT Press, 1998. Jonas Sj¨ berg and Lennart Ljung. Overtraining, regularization and searching for a minimum, with o application to neural networks. International Journal of Control, 62(6):1391–1407, 1995. Joshua M. Susskind, Geoffrey E., Javier R. Movellan, and Adam K. Anderson. Generating facial expressions with deep belief nets. In V. Kordic, editor, Affective Computing, Emotion Modelling, Synthesis and Recognition, pages 421–440. ARS Publishers, 2008. Joshua Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, December 2000. Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, November 2008. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Andrew McCallum and Sam Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008), pages 1096–1103. Omnipress, 2008. 659 E RHAN , B ENGIO , C OURVILLE , M ANZAGOL , V INCENT AND B ENGIO Max Welling, Michal Rosen-Zvi, and Geoffrey E. Hinton. Exponential family harmoniums with an application to information retrieval. In L.K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17 (NIPS’04), pages 1481–1488, Cambridge, MA, 2005. MIT Press. Jason Weston, Fr´ d´ ric Ratle, and Ronan Collobert. Deep learning via semi-supervised embede e ding. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), pages 1168–1175, New York, NY, USA, 2008. ACM. Andrew Yao. Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10, 1985. Long Zhu, Yuanhao Chen, and Alan Yuille. Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):114–128, 2009. 660