jmlr jmlr2008 jmlr2008-70 jmlr2008-70-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
Gilles Blanchard, Olivier Bousquet, and Laurent Zwald. Statistical properties of kernel principal component analysis. Machine Learning, 66(2–3):259–294, 2007. Mikio L. Braun. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7:2303–2328, Nov 2006. Mikio L. Braun. Spectral Properties of the Kernel Matrix and Their Application to Kernel Methods in Machine Learning. PhD thesis, University of Bonn, 2005. Available electronically at http://hss.ulb.uni-bonn.de/diss online/math nat fak/2005/braun mikio. Chris J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998. Chandler Davis and William M. Kahan. The rotation of eigenvectors by a perturbation, iii. SIAM Journal of Numerical Analysis, 7:1–46, 1970. Theodoros Evgeniou and Massimiliano Pontil. On the Vγ dimension for regression in reproducing kernel hilbert spaces. In Proceedings of Algorithmic Learning Theory, 1999. Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 160(620–630), 1957. Vladimir Koltchinskii and Evariste Gin´ . Random matrix approximation of spectra of integral e operators. Bernoulli, 6(1):113–167, 2000. 1906 O N R ELEVANT D IMENSIONS IN K ERNEL F EATURE S PACES Vladimir I. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191–227, 1998. Sebastian Mika. Kernel Fisher Discriminants. PhD thesis, Technische Universit¨ t Berlin, December a 2002. ¨ Klaus-Robert M¨ ller, Sebastian Mika, Gunnar R¨ tsch, Koji Tsuda, and Bernhard Scholkopf. An u a introduction to kernel-based learning algorithms. IEEE Transaction on Neural Networks, 12(2): 181–201, May 2001. ¨ Gunnar R¨ tsch, Takashi Onoda, and Klaus-Robert Muller. Soft margins for AdaBoost. Machine a Learning, 42(3):287–320, March 2001. Bernhard Sch¨ lkopf and Alexander J. Smola. Learning with Kernels. MIT Press, 2002. o ¨ Bernhard Sch¨ lkopf, Alexander J. Smola, and Klaus-Robert Muller. Nonlinear component analysis o as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. Bernhard Sch¨ lkopf, Sebastian Mika, Christopher J. C. Burges, Philipp Knirsch, Klaus-Robert o M¨ ller, Gunnar R¨ tsch, and Alex J. Smola. Input space vs. feature space in kernel-based methods. u a IEEE Transactions on Neural Networks, 10(5):1000–1017, 1999. John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization over date-dependent hierarchies. IEEE Transactions on Information Theory, 44(5): 1926–1940, 1998. John Shawe-Taylor, Christopher K. I. Williams, Nello Christianini, and Jaz Kandola. On the eigenspectrum of the Gram matrix and the generalization error of kernel-PCA. IEEE Transactions on Information Theory, 51(7):2510–2522, July 2005. Alex J. Smola, Bernhard Sch¨ lkopf, and Klaus-Robert M¨ ller. The connection between regularizao u tion operators and support vector kernels. Neural Networks, 11(4):637–649, 1998. S¨ ren Sonnenburg, Gunnar R¨ tsch, and Bernhard Sch¨ lkopf. Large scale genomic sequence SVM o a o classifiers. In Proceedings of the 22nd International Machine Learning Conference, pages 848– 855. ACM Press, 2005. Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. Vladimir Vapnik. Statistical Learning Theory. Wiley, 1998. R´ gis Vert, Laurent Zwald, Gilles Blanchard, and Pascal Massart. Kernel projection machine: a e new tool for pattern recognition. In Advances in Neural Information Processing Systems (NIPS 2004), pages 1649–1656. 2005, 2005. Ulrike von Luxburg. Statistical Learning with Similarity and Dissimilarity Functions. PhD thesis, Technische Universit¨ t Berlin, 2004. a Grace Wahba. Spline Models For Observational Data. Society for Industrial and Applied Mathematics, 1990. 1907 ¨ B RAUN , B UHMANN AND M ULLER ¨ Robert C. Williamson, Alex J. Smola, and Bernhard Scholkopf. Generalization bounds for regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transaction on Information Theory, 47(6):2516–2532, 2001. Tong Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17:2077–2098, 2005. Laurent Zwald and Gilles Blanchard. On the convergence of eigenspaces in kernel principal components analysis. In Advances in Neural Information Processing Systems (NIPS 2005), volume 18, 2006. 1908