nips nips2006 nips2006-88 nips2006-88-reference knowledge-graph by maker-knowledge-mining

88 nips-2006-Greedy Layer-Wise Training of Deep Networks

Source: pdf

Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efﬁcient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also conﬁrm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

reference text

Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 1–18. Lecture Notes in Computer Science 1180. Bengio, Y., Delalleau, O., & Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In Weiss, Y., Sch¨ lkopf, B., & Platt, J. (Eds.), Advances in Neural Information Processing o Systems 18, pp. 107–114. MIT Press, Cambridge, MA. Bengio, Y., & Le Cun, Y. (2007). Scaling learning algorithms towards AI. In Bottou, L., Chapelle, O., DeCoste, D., & Weston, J. (Eds.), Large Scale Kernel Machines. MIT Press. Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., & Marcotte, P. (2006). Convex neural networks. In Weiss, Y., Sch¨ lkopf, B., & Platt, J. (Eds.), Advances in Neural Information Processing Systems 18, pp. o 123–130. MIT Press, Cambridge, MA. Chen, H., & Murray, A. (2003). A continuous restricted boltzmann machine with an implementable training algorithm. IEE Proceedings of Vision, Image and Signal Processing, 150(3), 153–158. Fahlman, S., & Lebiere, C. (1990). The cascade-correlation learning architecture. In Touretzky, D. (Ed.), Advances in Neural Information Processing Systems 2, pp. 524–532 Denver, CO. Morgan Kaufmann, San Mateo. Hastad, J. T. (1987). Computational Limitations for Small Depth Circuits. MIT Press, Cambridge, MA. Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800. Hinton, G., Dayan, P., Frey, B., & Neal, R. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1558–1161. Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. Lengell´ , R., & Denoeux, T. (1996). Training MLPs layer by layer using an objective function for internal e representations. Neural Networks, 9, 83–97. Movellan, J., Mineiro, P., & Williams, R. (2002). A monte-carlo EM approach for partially observable diffusion processes: theory and applications to neural networks. Neural Computation, 14, 1501–1544. Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8, 257–277. Utgoff, P., & Stracuzzi, D. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval. In Advances in Neural Information Processing Systems, Vol. 17 Cambridge, MA. MIT Press.