jmlr jmlr2011 jmlr2011-48 jmlr2011-48-reference knowledge-graph by maker-knowledge-mining

48 jmlr-2011-Kernel Analysis of Deep Networks

Source: pdf

Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller

Abstract: When training deep networks it is common knowledge that an efﬁcient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels ﬁt the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations

reference text

Gaston Baudat and Fatiha Anouar. Generalized discriminant analysis using a kernel approach. Neural Computation, 12(10):2385–2404, 2000. Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19, pages 153–160, 2006. Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1996. 2578 K ERNEL A NALYSIS OF D EEP N ETWORKS y |uT Y | i dimensionality of the problem noise bed x i Figure 8: Illustration of the concept of relevant dimensions in kernel feature spaces (Braun et al., 2008). On the left, samples drawn from a toy distribution p(x, y). On the right, label contributions of each kPCA component u1 , . . . , un . It can be observed that a small number of leading principal components containing relevant label information emerge from a ﬂat noise bed. L´ on Bottou. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Nˆmes, e ı 1991. Olivier Bousquet, St´ phane Boucheron, and G´ bor Lugosi. Introduction to statistical learning thee a ory. In Olivier Bousquet, Ulrike von Luxburg, and Gunnar R¨ tsch, editors, Advanced Lectures a on Machine Learning, volume 3176, pages 169–207. Springer, 2004. Mikio L. Braun. Accurate bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7:2303–2328, 2006. Mikio L. Braun, Joachim Buhmann, and Klaus-Robert M¨ ller. On relevant dimensions in kernel u feature spaces. Journal of Machine Learning Research, 9:1875–1908, 2008. Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997. Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems 22, pages 342–350, 2009. Ronan Collobert and Jason Weston. A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167, 2008. Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11:625–660, 2010. Ian Goodfellow, Quoc Le, Andrew Saxe, and Andrew Y. Ng. Measuring invariances in deep networks. In Advances in Neural Information Processing Systems 22, pages 646–654, 2009. Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006. 2579 ¨ M ONTAVON , B RAUN AND M ULLER David H. Hubel and Torsten N. Wiesel. Receptive ﬁelds, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160:106–154, January 1962. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Kevin J. Lang, Alex H. Waibel, and Geoffrey E. Hinton. A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1):23–43, 1990. Hugo Larochelle, Yoshua Bengio, J´ rˆ me Louradour, and Pascal Lamblin. Exploring strategies for eo training deep neural networks. Journal of Machine Learning Research, 10:1–40, 2009. Yann LeCun. Generalization and network design strategies. In Connectionism in Perspective. Elsevier, 1989. An extended version was published as a technical report of the University of Toronto. Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied e to document recognition. Proceedings of the IEEE, 86(1):2278–2324, 1998. Sebastian Mika, Gunnar R¨ tsch, Jason Weston, Bernhard Sch¨ lkopf, and Klaus-Robert M¨ ller. a o u Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, pages 41–48, 1999. Sebastian Mika, Gunnar R¨ tsch, Jason Weston, Bernhard Sch¨ lkopf, Alex J. Smola, and Klausa o Robert M¨ ller. Learning discriminative and invariant nonlinear features. IEEE Transactions on u Pattern Analysis and Machine Intelligence, 25(5):623–628, 2003. Gr´ goire Montavon, Mikio Braun, and Klaus-Robert M¨ ller. Layer-wise analysis of deep networks e u with Gaussian kernels. In Advances in Neural Information Processing Systems 23, pages 1678– 1686, 2010. Klaus-Robert M¨ ller, Sebastian Mika, Gunnar R¨ tsch, Koji Tsuda, and Bernhard Sch¨ lkopf. An u a o introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2): 181–202, 2001. Vinod Nair and Geoffrey E. Hinton. Rectiﬁed linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814, 2010. Genevieve B. Orr and Klaus-Robert M¨ ller, editors. Neural Networks: Tricks of the Trade, volume u 1524 of Lecture Notes in Computer Science, 1998. Springer. This book is an outgrowth of a 1996 NIPS workshop. Dario L. Ringach. Spatial structure and symmetry of simple-cell receptive ﬁelds in macaque primary visual cortex. The Journal of Neurophysiology, 88(1):455–463, 2002. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986. Ruslan Salakhutdinov and Geoffrey Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, volume 11, 2007. 2580 K ERNEL A NALYSIS OF D EEP N ETWORKS Bernhard Sch¨ lkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, o Regularization, Optimization, and Beyond. MIT Press, 2002. Bernhard Sch¨ lkopf, Alexander Smola, and Klaus-Robert M¨ ller. Nonlinear component analysis as o u a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998. Bernhard Sch¨ lkopf, Sebastian Mika, Chris J. C. Burges, Philipp Knirsch, Klaus-Robert M¨ ller, o u Gunnar R¨ tsch, and Alex J. Smola. Input space versus feature space in kernel-based methods. a IEEE Transactions on Neural Networks, 10(5):1000–1017, 1999. Thomas Serre, Lior Wolf, and Tomaso Poggio. Object recognition with features inspired by visual cortex. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 994–1000, 2005. Steve Smale, Lorenzo Rosasco, Jack Bouvrie, Andrea Caponnetto, and Tomaso Poggio. Mathematics of the neural response. Foundations of Computational Mathematics, 10(1):67–91, 2010. Alex J. Smola, Bernhard Sch¨ lkopf, and Klaus-Robert M¨ ller. The connection between regularizao u tion operators and support vector kernels. Neural Networks, 11(4):637–649, 1998. Jason Weston, Fr´ d´ ric Ratle, and Ronan Collobert. Deep learning via semi-supervised embedding. e e In Proceedings of the 25th International Conference on Machine Learning, pages 1168–1175, 2008. Andre Wibisono, Jake Bouvrie, Lorenzo Rosasco, and Tomaso Poggio. Learning and invariance in a family of hierarchical kernels. Technical report, Massachusetts Institute of Technology, 2010. 2581