nips nips2006 nips2006-65 nips2006-65-reference knowledge-graph by maker-knowledge-mining

65 nips-2006-Denoising and Dimension Reduction in Feature Space

Source: pdf

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classiﬁcation problem in feature space is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data to perform classiﬁcation. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classiﬁcation results. 1

reference text

[1] ML Braun. Spectral Properties of the Kernel Matrix and Their Application to Kernel Methods in Machine Learning. PhD thesis, University of Bonn, 2005. Available electronically at http://hss.ulb.unibonn.de/diss online/math nat fak/2005/braun mikio.

[2] ML Braun. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 2006. To appear.

[3] V Koltchinskii and E Gin´ . Random matrix approximation of spectra of integral operators. Bernoulli, e 6(1):113–167, 2000.

[4] VI Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191–227, 1998.

[5] S Mika, B Sch¨ lkopf, A Smola, K-R M¨ ller, M Scholz, and Gunnar R¨ thsch. Kernel PCA and de-noising o u a in feature space. In Advances in Neural Information Processing Systems 11. MIT Press, 1999.

[6] K-R M¨ ller, S Mika, G R¨ tsch, K Tsuda, and B Sch¨ lkopf. An introduction to kernel-based learning u a o algorithms. IEEE Transaction on Neural Networks, 12(2):181–201, May 2001.

[7] G R¨ tsch, T Onoda, and K-R M¨ ller. Soft margins for AdaBoost. Machine Learning, 42(3):287–320, a u March 2001.

[8] B Sch¨ lkopf, S Mika, CJC Burges, P Knirsch, K-R M¨ ller, G R¨ tsch, and AJ Smola. Input space vs. feao u a ture space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5):1000–1017, September 1999.

[9] B Sch¨ lkopf and AJ Smola. Learning with Kernels. MIT Press, 2001. o

[10] B Sch¨ lkopf, AJ Smola, and K-R M¨ ller. Nonlinear component analysis as a kernel eigenvalue problem. o u Neural Computation, 10:1299–1319, 1998.

[11] V Vapnik. Statistical Learning Theory. Wiley, 1998.

[12] G Wahba. Spline Models For Observational Data. Society for Industrial and Applied Mathematics, 1990.

[13] T Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17:2077–2098, 2005.

[14] L Zwald and G Blanchard. On the convergence of eigenspaces in kernel principal components analysis. In Advances in Neural Information Processing Systems (NIPS 2005), volume 18, 2006. A Proof of Theorem 1 First, let us collect the deﬁnitions concerning kernel functions. Let k be a Mercer kernel with ∞ k(x, x ) = i=1 λi ψi (x)ψi (x ), and k(x, x) ≤ K < ∞. The kernel matrix of k for an n-sample X1 , . . . , Xn is [K]ij = k(Xi , Xj )/n. Eigenvalues of K are li , and its eigenvectors are ui . The r kernel k is approximated by the truncated kernel matrix is kr (x, x ) = i=1 λi ψi (x)ψi (x ), and its kernel matrix is denoted by Kr , whose eigenvalues are mi . The approximation error is measured by Er = Kr − K. We will measure the amount of clustering ci of the eigenvalues by the number of eigenvalues of Kr between li /2 and 2li . The matrix √ containing the sample vectors of the ﬁrst r eigenfunctions ψi of k is given by [Ψr ]i = ψ (Xi )/ n, 1 ≤ i ≤ n, 1 ≤ ≤ r. Since the eigenfunctions are orthogonal asymptotically, we can expect that the sample vectors of the eigenfunctions converge to either 0 or 1. The error is measured by the matrix Cr = Ψr Ψr − I. Finally, ∞ let Λr = i=r+1 λi . ∞ Next, we collect deﬁnitions concerning some function f . Let f = i=1 λi αi ψi with (αi ) ∈ 2 , and |f | ≤ A < ∞. The size of the contribution of the ﬁrst r terms is measured by ar = r i=1 |αi |. Deﬁne the error of truncating f to the ﬁrst r elements of its series expansion by Tr = ∞ 2 ( i=r+1 λ2 αi )1/2 . i The proof of Theorem 1 is based on performing rough estimates of the bound from Theorem 4.92 in [1]. The bound is 1 √ |ui f (X)| < min li D(r, n) + E(r, n) + T (r, n) 1≤r≤n n where the three terms are given by 1 , nδ It holds that Ψ+ ≤ (1 − Ψr Ψr − I )1/2 = (1 − Cr )−1/2 ([1], Lemma 4.44). Furthermore, r Cr → 1 as n → ∞ for ﬁxed r. For kernel with bounded diagonal, it holds that with probability larger than 1−δ ([1], Lemma 3.135) that Cr ≤ r r(r + 1)K/λr nδ = r2 O(n−1/2 ) with a rather large constant, especially, if λr is small. Consequently, Ψ+ ≤ (1− Cr )−1/2 = 1+O(rn−1/4 ). r D(r, n) = 2ar Ψ+ ci , r E(r, n) = 2rar Ψ+ r Er , T (r, n) = Tr + F Tr 4 Now, Lemma 3.135 in [1] bounds Er from which we can derive the asymptotic 1 2KΛr = Λr + Λr O(n− 2 ), nδ assuming that K will be reasonably small (for example, for rbf-kernels, K = 1). Combining this with our rate for Ψ+ , we obtain r Er ≤ λr + Λr + E(r, n) = 2rar Λr + 1 1 1 Λr O(n− 2 ) (1+O(rn− 4 )) = 2rar Λr (1+O(rn− 4 ))+rar 1 Λr O(n− 2 ). Finally, we obtain 1 1 √ |ui f (X)| = 2li ar ci (1 + O(rn− 4 )) n 1 + 2rar Λr (1 + O(rn− 4 )) + rar 1 Λr O(n− 2 ). + Tr + 1 1 ATr O(n− 2 ). If we assume that Λr will be rather small, we replace 1 + O(rn− 4 ) by O(1) for the second term and obtain the claimed rate.