nips nips2009 nips2009-119 nips2009-119-reference knowledge-graph by maker-knowledge-mining

119 nips-2009-Kernel Methods for Deep Learning

Source: pdf

Author: Youngmin Cho, Lawrence K. Saul

Abstract: We introduce a new family of positive-deﬁnite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1

reference text

[1] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. MIT Press, 2007.

[2] G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.

[3] G.E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006.

[4] M.A. Ranzato, F.J. Huang, Y.L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR-07), pages 1–8, 2007.

[5] R. Collobert and J. Weston. A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML-08), pages 160–167, 2008.

[6] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, to appear, 2009.

[7] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In Proceedings of the 25th International Conference on Machine Learning (ICML-08), pages 1168–1175, 2008.

[8] C.K.I. Williams. Computation with inﬁnite neural networks. Neural Computation, 10(5):1203–1216, 1998.

[9] R.H.R. Hahnloser, H.S. Seung, and J.J. Slotine. Permitted and forbidden sets in symmetric thresholdlinear networks. Neural Computation, 15(3):621–638, 2003.

[10] R.M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., 1996.

[11] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th International Conference on Machine Learning (ICML-07), pages 473–480, 2007.

[12] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.

[13] B. Sch¨ lkopf, A. Smola, and K. M¨ ller. Nonlinear component analysis as a kernel eigenvalue problem. o u Neural Computation, 10(5):1299–1319, 1998.

[14] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003.

[15] K.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. Journal of Machine Learning Research, 10:207–244, 2009.

[16] B. Sch¨ lkopf, A. J. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Technical Report 44, Max-Planck-Institut f¨ r biologische Kybernetik, 1996. u

[17] J. Goldberger, S. Roweis, G.E. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In L.K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 513–520. MIT Press, 2005.

[18] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, o pages 153–160. MIT Press, 2007.

[19] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face veriﬁcation. In Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recognition (CVPR-05), pages 539–546, 2005.

[20] Y. LeCun and C. Cortes. The MNIST database of handwritten digits. http://yann.lecun.com/ exdb/mnist/.

[21] M. Tipping. Sparse kernel principal component analysis. In Advances in Neural Information Processing Systems 13. MIT Press, 2001.

[22] A.J. Smola, O.L. Mangasarian, and B. Sch¨ lkopf. Sparse kernel feature analysis. Technical Report 99-04, o University of Wisconsin, Data Mining Institute, Madison, 1999.

[23] G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, and M.I. Jordan. Learning the kernel matrix with semideﬁnite programming. Journal of Machine Learning Research, 5:27–72, 2004.

[24] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.

[25] G.F. Carrier, M. Krook, and C.E. Pearson. Functions of a Complex Variable: Theory and Technique. Society for Industrial and Applied Mathematics, 2005. 9