nips nips2005 nips2005-147 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Laurent Zwald, Gilles Blanchard
Abstract: This paper presents a non-asymptotic statistical analysis of Kernel-PCA with a focus different from the one proposed in previous work on this topic. Here instead of considering the reconstruction error of KPCA we are interested in approximation error bounds for the eigenspaces themselves. We prove an upper bound depending on the spacing between eigenvalues but not on the dimensionality of the eigenspace. As a consequence this allows to infer stability results for these estimated spaces. 1 Introduction. Principal Component Analysis (PCA for short in the sequel) is a widely used tool for data dimensionality reduction. It consists in finding the most relevant lower-dimension projection of some data in the sense that the projection should keep as much of the variance of the original data as possible. If the target dimensionality of the projected data is fixed in advance, say D – an assumption that we will make throughout the present paper – the solution of this problem is obtained by considering the projection on the span SD of the first D eigenvectors of the covariance matrix. Here by ’first D eigenvectors’ we mean eigenvectors associated to the D largest eigenvalues counted with multiplicity; hereafter with some abuse the span of the first D eigenvectors will be called “D-eigenspace” for short when there is no risk of confusion. The introduction of the ’Kernel trick’ has allowed to extend this methodology to data mapped in a kernel feature space, then called KPCA [8]. The interest of this extension is that, while still linear in feature space, it gives rise to nonlinear interpretation in original space – vectors in the kernel feature space can be interpreted as nonlinear functions on the original space. For PCA as well as KPCA, the true covariance matrix (resp. covariance operator) is not known and has to be estimated from the available data, an procedure which in the case of ¨ Kernel spaces is linked to the so-called Nystrom approximation [13]. The subspace given as an output is then obtained as D-eigenspace SD of the empirical covariance matrix or operator. An interesting question from a statistical or learning theoretical point of view is then, how reliable this estimate is. This question has already been studied [10, 2] from the point of view of the reconstruction error of the estimated subspace. What this means is that (assuming the data is centered in Kernel space for simplicity) the average reconstruction error (square norm of the distance to the projection) of SD converges to the (optimal) reconstruction error of SD and that bounds are known about the rate of convergence. However, this does not tell us much about the convergence of SD to SD – since two very different subspaces can have a very similar reconstruction error, in particular when some eigenvalues are very close to each other (the gap between the eigenvalues will actually appear as a central point of the analysis to come). In the present work, we set to study the behavior of these D-eigenspaces themselves: we provide finite sample bounds describing the closeness of the D-eigenspaces of the empirical covariance operator to the true one. There are several broad motivations for this analysis. First, the reconstruction error alone is a valid criterion only if one really plans to perform dimensionality reduction of the data and stop there. However, PCA is often used merely as a preprocessing step and the projected data is then submitted to further processing (which could be classification, regression or something else). In particular for KPCA, the projection subspace in the kernel space can be interpreted as a subspace of functions on the original space; one then expects these functions to be relevant for the data at hand and for some further task (see e.g. [3]). In these cases, if we want to analyze the full procedure (from a learning theoretical sense), it is desirable to have a more precise information on the selected subspace than just its reconstruction error. In particular, from a learning complexity point of view, it is important to ensure that functions used for learning stay in a set of limited complexity, which is ensured if the selected subspace is stable (which is a consequence of its convergence). The approach we use here is based on perturbation bounds and we essentially walk in the steps pioneered by Kolchinskii and Gin´ [7] (see also [4]) using tools of operator perturbae tion theory [5]. Similar methods have been used to prove consistency of spectral clustering [12, 11]. An important difference here is that we want to study directly the convergence of the whole subspace spanned by the first D eigenvectors instead of the separate convergence of the individual eigenvectors; in particular we are interested in how D acts as a complexity parameter. The important point in our main result is that it does not: only the gap between the D-th and the (D + 1)-th eigenvalue comes into account. This means that there in no increase in complexity (as far as this bound is concerned: of course we cannot exclude that better bounds can be obtained in the future) between estimating the D-th eigenvector alone or the span of the first D eigenvectors. Our contribution in the present work is thus • to adapt the operator perturbation result of [7] to D-eigenspaces. • to get non-asymptotic bounds on the approximation error of Kernel-PCA eigenspaces thanks to the previous tool. In section 2 we introduce shortly the notation, explain the main ingredients used and obtain a first bound based on controlling separately the first D eigenvectors, and depending on the dimension D. In section 3 we explain why the first bound is actually suboptimal and derive an improved bound as a consequence of an operator perturbation result that is more adapted to our needs and deals directly with the D-eigenspace as a whole. Section 4 concludes and discusses the obtained results. Mathematical proofs are found in the appendix. 2 First result. Notation. The interest variable X takes its values in some measurable space X , following the distribution P . We consider KPCA and are therefore primarily interested in the mapping of X into a reproducing kernel Hilbert space H with kernel function k through the feature mapping ϕ(x) = k(x, ·). The objective of the kernel PCA procedure is to recover a D-dimensional subspace SD of H such that the projection of ϕ(X) on SD has maximum averaged squared norm. All operators considered in what follows are Hilbert-Schmidt and the norm considered for these operators will be the Hilbert-Schmidt norm unless precised otherwise. Furthermore we only consider symmetric nonnegative operators, so that they can be diagonalized and have a discrete spectrum. Let C denote the covariance operator of variable ϕ(X). To simplify notation we assume that nonzero eigenvalues λ1 > λ2 > . . . of C are all simple (This is for convenience only. In the conclusion we discuss what changes have to be made if this is not the case). Let φ1 , φ2 , . . . be the associated eigenvectors. It is well-known that the optimal D-dimensional reconstruction space is SD = span{φ1 , . . . , φD }. The KPCA procedure approximates this objective by considering the empirical covariance operator, denoted Cn , and the subspace SD spanned by its first D eigenvectors. We denote PSD , PSD the orthogonal projectors on b these spaces. A first bound. Broadly speaking, the main steps required to obtain the type of result we are interested in are 1. A non-asympotic bound on the (Hilbert-Schmidt) norm of the difference between the empirical and the true covariance operators; 2. An operator perturbation result bounding the difference between spectral projectors of two operators by the norm of their difference. The combination of these two steps leads to our goal. The first step consists in the following Lemma coming from [9]: Lemma 1 (Corollary 5 of [9]) Supposing that supx∈X k(x, x) ≤ M , with probability greater than 1 − e−ξ , 2M ξ Cn − C ≤ √ 1+ . 2 n As for the second step, [7] provides the following perturbation bound (see also e.g. [12]): Theorem 2 (Simplified Version of [7], Theorem 5.2 ) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple positive eigenvalues λ 1 > 1 λ2 > . . . For an integer r such that λr > 0, let δr = δr ∧ δr−1 where δr = 2 (λr − λr+1 ). Let B ∈ HS(H) be another symmetric operator such that B < δr /2 and (A + B) is still a positive operator with simple nonzero eigenvalues. Let Pr (A) (resp. Pr (A + B)) denote the orthogonal projector onto the subspace spanned by the r-th eigenvector of A (resp. (A + B)). Then, these projectors satisfy: Pr (A) − Pr (A + B) ≤ 2 B δr . Remark about the Approximation Error of the Eigenvectors: let us recall that a control over the Hilbert-Schmidt norm of the projections onto eigenspaces imply a control on the approximation errors of the eigenvectors themselves. Indeed, let φr , ψr denote the (normalized) r-th eigenvectors of the operators above with signs chosen so that φ r , ψr > 0. Then P φ r − P ψr 2 2 = 2(1 − φr , ψr ) ≥ 2(1 − φr , ψr ) = φr − ψr 2 . Now, the orthogonal projector on the direct sum of the first D eigenspaces is the sum D r=1 Pr . Using the triangle inequality, and combining Lemma 1 and Theorem 2, we conclude that with probability at least 1 − e−ξ the following holds: D P SD − P SD ≤ b −1 δr r=1 4M √ n 1+ ξ 2 , 2 provided that n ≥ 16M 2 1 + ξ 2 −2 (sup1≤r≤D δr ) . The disadvantage of this bound is that we are penalized on the one hand by the (inverse) gaps between the eigenvalues, and on the other by the dimension D (because we have to sum the inverse gaps from 1 to D). In the next section we improve the operator perturbation bound to get an improved result where only the gap δD enters into account. 3 Improved Result. We first prove the following variant on the operator perturbation property which better corresponds to our needs by taking directly into account the projection on the first D eigenvectors at once. The proof uses the same kind of techniques as in [7]. Theorem 3 Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple nonzero eigenvalues λ1 > λ2 > . . . Let D > 0 be an integer such that λD > 0, δD = 1 (λD − λD+1 ). Let B ∈ HS(H) be another symmetric operator such that 2 B < δD /2 and (A + B) is still a positive operator. Let P D (A) (resp. P D (A + B)) denote the orthogonal projector onto the subspace spanned by the first D eigenvectors A (resp. (A + B)). Then these satisfy: P D (A) − P D (A + B) ≤ B . δD (1) This then gives rise to our main result on KPCA: Theorem 4 Assume that supx∈X k(x, x) ≤ M . Let SD , SD be the subspaces spanned by the first D eigenvectors of C, resp. Cn defined earlier. Denoting λ1 > λ2 > . . . the 1 eigenvalues of C, if D > 0 is such that λD > 0, put δD = 2 (λD − λD+1 ) and BD = 2M δD 1+ ξ 2 . 2 Then provided that n ≥ BD , the following bound holds with probability at least 1 − e−ξ : BD P SD − P SD ≤ √ . b n (2) This entails in particular ⊥ SD ⊂ g + h, g ∈ SD , h ∈ SD , h 1 Hk ≤ 2BD n− 2 g Hk . (3) The important point here is that the approximation error now only depends on D through the (inverse) gap between the D-th and (D + 1)-th eigenvalues. Note that using the results of section 2, we would have obtained exactly the same bound for estimating the D-th eigenvector only – or even a worse bound since δD = δD ∧ δD−1 appears in this case. Thus, at least from the point of view of this technique (which could still yield suboptimal bounds), there is no increase of complexity between estimating the D-th eigenvector alone and estimating the span of the first D eigenvectors. Note that the inclusion (3) can be interpreted geometrically by saying that for any vector in SD , the √ tangent of the angle between this vector and its projection on SD is upper bounded by BD / n, which we can interpret as a stability property. Comment about the Centered Case. In the actual (K)PCA procedure, the data is actually first empirically recentered, so that one has to consider the centered covariance operator C and its empirical counterpart C n . A result similar to Theorem 4 also holds in this case (up to some additional constant factors). Indeed, a result similar to Lemma 1 holds for the recentered operators [2]. Combined again with Theorem 3, this allows to come to similar conclusions for the “true” centered KPCA. 4 Conclusion and Discussion In this paper, finite sample size confidence bounds of the eigenspaces of Kernel-PCA (the D-eigenspaces of the empirical covariance operator) are provided using tools of operator perturbation theory. This provides a first step towards an in-depth complexity analysis of algorithms using KPCA as pre-processing, and towards taking into account the randomness of the obtained models (e.g. [3]). We proved a bound in which the complexity factor for estimating the eigenspace SD by its empirical counterpart depends only on the inverse gap between the D-th and (D + 1)-th eigenvalues. In addition to the previously cited works, we take into account the centering of the data and obtain comparable rates. In this work we assumed for simplicity of notation the eigenvalues to be simple. In the case the covariance operator C has nonzero eigenvalues with multiplicities m1 , m2 , . . . possibly larger than one, the analysis remains the same except for one point: we have to assume that the dimension D of the subspaces considered is of the form m1 + · · · + mr for a certain r. This could seem restrictive in comparison with the results obtained for estimating the sum of the first D eigenvalues themselves [2] (which is linked to the reconstruction error in KPCA) where no such restriction appears. However, it should be clear that we need this restriction when considering D−eigenspaces themselves since the target space has to be unequivocally defined, otherwise convergence cannot occur. Thus, it can happen in this special case that the reconstruction error converges while the projection space itself does not. Finally, a common point of the two analyses (over the spectrum and over the eigenspaces) lies in the fact that the bounds involve an inverse gap in the eigenvalues of the true covariance operator. Finally, how tight are these bounds and do they at least carry some correct qualitative information about the behavior of the eigenspaces? Asymptotic results (central limit Theorems) in [6, 4] always provide the correct goal to shoot for since they actually give the limit distributions of these quantities. They imply that there is still important ground to cover before bridging the gap between asymptotic and non-asymptotic. This of course opens directions for future work. Acknowledgements: This work was supported in part by the PASCAL Network of Excellence (EU # 506778). A Appendix: proofs. Proof of Lemma 1. This lemma is proved in [9]. We give a short proof for the sake of n 1 completness. Cn − C = n i=1 CXi − E [CX ] with CX = ϕ(X) ⊗ ϕ(X)∗ = k(X, X) ≤ M . We can apply the bounded difference inequality to the variable Cn − C , so that with probability greater than 1 − e−ξ , Cn − C ≤ E [ Cn − C ] + 2M Moreover, by Jensen’s inequality E [ Cn − C ] ≤ E n 1 simple calculations leads to E n i=1 CXi − E [CX ] 2 4M n . This concludes the proof of lemma 1. 1 n 2 n i=1 CXi 1 = nE 2 ξ 2n . 1 2 − E [CX ] , and CX − E [CX ] 2 ≤ Proof of Theorem 3. The variation of this proof with respect to Theorem 5.2 in [7] is (a) to work directly in a (infinite-dimensional) Hilbert space, requiring extra caution for some details and (b) obtaining an improved bound by considering D-eigenspaces at once. The key property of Hilbert-Schmidt operators allowing to work directly in a infinite dimensional setting is that HS(H) is a both right and left ideal of Lc (H, H), the Banach space of all continuous linear operators of H endowed with the operator norm . op . Indeed, ∀ T ∈ HS(H), ∀S ∈ Lc (H, H), T S and ST belong to HS(H) with TS ≤ T S ST ≤ T and op S op . (4) The spectrum of an Hilbert-Schmidt operator T is denoted Λ(T ) and the sequence of eigenvalues in non-increasing order is denoted λ(T ) = (λ1 (T ) ≥ λ2 (T ) ≥ . . .) . In the following, P D (T ) denotes the orthogonal projector onto the D-eigenspace of T . The Hoffmann-Wielandt inequality in infinite dimensional setting[1] yields that: λ(A) − λ(A + B) 2 ≤ B ≤ δD . 2 (5) implying in particular that ∀i > 0, |λi (A) − λi (A + B)| ≤ δD . 2 (6) Results found in [5] p.39 yield the formula P D (A) − P D (A + B) = − 1 2iπ γ (RA (z) − RA+B (z))dz ∈ Lc (H, H) . (7) where RA (z) = (A − z Id)−1 is the resolvent of A, provided that γ is a simple closed curve in C enclosing exactly the first D eigenvalues of A and (A + B). Moreover, the same reference (p.60) states that for ξ in the complementary of Λ(A), RA (ξ) op = dist(ξ, Λ(A)) −1 . (8) The proof of the theorem now relies on the simple choice for the closed curve γ in (7), drawn in the picture below and consisting of three straight lines and a semi-circle of radius D L. For all L > δ2 , γ intersect neither the eigenspectrum of A (by equation (6)) nor the eigenspectrum of A + B. Moreover, the eigenvalues of A (resp. A + B) enclosed by γ are exactly λ1 (A), . . . , λD (A) (resp. λ1 (A + B), . . . , λD (A + B)). Moreover, for z ∈ γ, T (z) = RA (z) − RA+B (z) = −RA+B (z)BRA (z) belongs to HS(H) and depends continuously on z by (4). Consequently, P D (A) − P D (A + B) ≤ 1 2π b a (RA − RA+B )(γ(t)) |γ (t)|dt . N Let SN = n=0 (−1)n (RA (z)B)n RA (z). RA+B (z) = (Id + RA (z)B)−1 RA (z) and, for z ∈ γ and L > δD , RA (z)B op ≤ RA (z) op B ≤ δD 1 ≤ , 2 dist(z, Λ(A)) 2 γ L L δD λ 0 D+1 δD λ2 λD λ1 δD δD δD 2 2 2 L . op imply that SN −→ RA+B (z) (uniformly for z ∈ γ). Using property (4), since B ∈ . HS(H), SN BRA (z) −→ RA+B (z)BRA (z) = RA+B (z) − RA (z) . Finally, RA (z) − RA+B (z) = (−1)n (RA (z)B)n RA (z) n≥1 where the series converges in HS(H), uniformly in z ∈ γ. Using again property (4) and (8) implies B n (RA − RA+B )(γ(t)) ≤ RA (γ(t)) n+1 B n ≤ op distn+1 (γ(t), Λ(A)) n≥1 Finally, since for L > δD , B ≤ δD 2 n≥1 ≤ dist(γ(t),Λ(A)) , 2 b B 1 |γ (t)|dt . 2 (γ(t), Λ(A)) π a dist Splitting the last integral into four parts according to the definition of the contour γ, we obtain P D (A) − P D (A + B) ≤ 2arctan( δL ) 1 µ1 (A) − (µD (A) − δD ) π D |γ (t)|dt ≤ + +2 , dist2 (γ(t), Λ(A)) δD L L2 a and letting L goes to infinity leads to the result. b Proof of Theorem 4. Lemma 1 and Theorem 3 yield inequality (2). Together with as1 2 sumption n ≥ BD it implies PSD − PSD ≤ 2 . Let f ∈ SD : f = PSD (f ) + PSD (f ) . ⊥ b Lemma 5 below with F = SD and G = SD , and the fact that the operator norm is bounded by the Hilbert-Schmidt norm imply that 4 PSD (f ) 2 k ≤ PSD − PSD 2 PSD (f ) 2 k . ⊥ b H H 3 Gathering the different inequalities, Theorem 4 is proved. Lemma 5 Let F and G be two vector subspaces of H such that PF − PG the following bound holds: 4 PF − PG 2 PF (f ) 2 . ∀ f ∈ G , PF ⊥ (f ) 2 ≤ H op H 3 op 1 ≤ 2 . Then Proof of Lemma 5. = f − PF (f ) 2 = (PG − PF )(f ) 2 = PF − PF ⊥ (f ) 2 For f ∈ G, we have PG (f ) = f , hence PF (f ) 2 op PG 2 op ≤ P F − PG gathering the terms containing PF ⊥ (f ) 1/4 leads to the conclusion. 2 f 2 2 + PF ⊥ (f ) 2 on the left-hand side and using PF −PG 2 op ≤ References [1] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in infinite dimensions. Proc.Indian Acad.Sci(Math. Sci.) 104 (3), p. 483-494, 1994. [2] G. Blanchard, O. Bousquet, and L. Zwald. Statistical Properties of Kernel Principal Component Analysis. Proceedings of the 17th. Conference on Learning Theory (COLT 2004), p. 594–608. Springer, 2004. [3] G. Blanchard, P. Massart, R. Vert, and L. Zwald. Kernel projection machine: a new tool for pattern recognition. Proceedings of the 18th. Neural Information Processing System (NIPS 2004), p. 1649–1656. MIT Press, 2004. [4] J. Dauxois, A. Pousse, and Y. Romain. Asymptotic theory for the Principal Component Analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis 12, 136-154, 1982. [5] T. Kato. Perturbation Theory for Linear Operators. New-York: Springer-Verlag, 1966. [6] V. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191–227, 1998. [7] V. Koltchinskii and E. Gin´ . Random matrix approximation of spectra of integral e operators. Bernoulli, 6(1):113–167, 2000. [8] B. Sch¨ lkopf, A. J. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a o u kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [9] J. Shawe-Taylor and N. Cristianini. Estimating the moments of a random vector with applications. Proceedings of the GRETSI 2003 Conference, p. 47-52, 2003. [10] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix and the generalisation error of Kernel PCA. IEEE Transactions on Information Theory 51 (7), p. 2510-2522, 2005. [11] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Technical Report 134, Max Planck Institute for Biological Cybernetics, 2004. [12] U. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering on random samples: the normalized case. Proceedings of the 17th Annual Conference on Learning Theory (COLT 2004), p. 457–471. Springer, 2004. [13] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. Proceedings of the 17th International Conference on Machine Learning (ICML), p. 1159–1166. Morgan Kaufmann, 2000.
Reference: text
sentIndex sentText sentNum sentScore
1 Here instead of considering the reconstruction error of KPCA we are interested in approximation error bounds for the eigenspaces themselves. [sent-9, score-0.497]
2 We prove an upper bound depending on the spacing between eigenvalues but not on the dimensionality of the eigenspace. [sent-10, score-0.237]
3 As a consequence this allows to infer stability results for these estimated spaces. [sent-11, score-0.053]
4 Principal Component Analysis (PCA for short in the sequel) is a widely used tool for data dimensionality reduction. [sent-13, score-0.048]
5 It consists in finding the most relevant lower-dimension projection of some data in the sense that the projection should keep as much of the variance of the original data as possible. [sent-14, score-0.156]
6 If the target dimensionality of the projected data is fixed in advance, say D – an assumption that we will make throughout the present paper – the solution of this problem is obtained by considering the projection on the span SD of the first D eigenvectors of the covariance matrix. [sent-15, score-0.385]
7 Here by ’first D eigenvectors’ we mean eigenvectors associated to the D largest eigenvalues counted with multiplicity; hereafter with some abuse the span of the first D eigenvectors will be called “D-eigenspace” for short when there is no risk of confusion. [sent-16, score-0.458]
8 The introduction of the ’Kernel trick’ has allowed to extend this methodology to data mapped in a kernel feature space, then called KPCA [8]. [sent-17, score-0.065]
9 The interest of this extension is that, while still linear in feature space, it gives rise to nonlinear interpretation in original space – vectors in the kernel feature space can be interpreted as nonlinear functions on the original space. [sent-18, score-0.144]
10 For PCA as well as KPCA, the true covariance matrix (resp. [sent-19, score-0.079]
11 covariance operator) is not known and has to be estimated from the available data, an procedure which in the case of ¨ Kernel spaces is linked to the so-called Nystrom approximation [13]. [sent-20, score-0.152]
12 The subspace given as an output is then obtained as D-eigenspace SD of the empirical covariance matrix or operator. [sent-21, score-0.208]
13 This question has already been studied [10, 2] from the point of view of the reconstruction error of the estimated subspace. [sent-23, score-0.155]
14 What this means is that (assuming the data is centered in Kernel space for simplicity) the average reconstruction error (square norm of the distance to the projection) of SD converges to the (optimal) reconstruction error of SD and that bounds are known about the rate of convergence. [sent-24, score-0.444]
15 In the present work, we set to study the behavior of these D-eigenspaces themselves: we provide finite sample bounds describing the closeness of the D-eigenspaces of the empirical covariance operator to the true one. [sent-26, score-0.363]
16 First, the reconstruction error alone is a valid criterion only if one really plans to perform dimensionality reduction of the data and stop there. [sent-28, score-0.187]
17 In particular for KPCA, the projection subspace in the kernel space can be interpreted as a subspace of functions on the original space; one then expects these functions to be relevant for the data at hand and for some further task (see e. [sent-30, score-0.391]
18 In these cases, if we want to analyze the full procedure (from a learning theoretical sense), it is desirable to have a more precise information on the selected subspace than just its reconstruction error. [sent-33, score-0.221]
19 In particular, from a learning complexity point of view, it is important to ensure that functions used for learning stay in a set of limited complexity, which is ensured if the selected subspace is stable (which is a consequence of its convergence). [sent-34, score-0.16]
20 The approach we use here is based on perturbation bounds and we essentially walk in the steps pioneered by Kolchinskii and Gin´ [7] (see also [4]) using tools of operator perturbae tion theory [5]. [sent-35, score-0.406]
21 Similar methods have been used to prove consistency of spectral clustering [12, 11]. [sent-36, score-0.062]
22 An important difference here is that we want to study directly the convergence of the whole subspace spanned by the first D eigenvectors instead of the separate convergence of the individual eigenvectors; in particular we are interested in how D acts as a complexity parameter. [sent-37, score-0.41]
23 The important point in our main result is that it does not: only the gap between the D-th and the (D + 1)-th eigenvalue comes into account. [sent-38, score-0.08]
24 This means that there in no increase in complexity (as far as this bound is concerned: of course we cannot exclude that better bounds can be obtained in the future) between estimating the D-th eigenvector alone or the span of the first D eigenvectors. [sent-39, score-0.343]
25 Our contribution in the present work is thus • to adapt the operator perturbation result of [7] to D-eigenspaces. [sent-40, score-0.343]
26 • to get non-asymptotic bounds on the approximation error of Kernel-PCA eigenspaces thanks to the previous tool. [sent-41, score-0.307]
27 In section 2 we introduce shortly the notation, explain the main ingredients used and obtain a first bound based on controlling separately the first D eigenvectors, and depending on the dimension D. [sent-42, score-0.075]
28 In section 3 we explain why the first bound is actually suboptimal and derive an improved bound as a consequence of an operator perturbation result that is more adapted to our needs and deals directly with the D-eigenspace as a whole. [sent-43, score-0.601]
29 We consider KPCA and are therefore primarily interested in the mapping of X into a reproducing kernel Hilbert space H with kernel function k through the feature mapping ϕ(x) = k(x, ·). [sent-49, score-0.186]
30 The objective of the kernel PCA procedure is to recover a D-dimensional subspace SD of H such that the projection of ϕ(X) on SD has maximum averaged squared norm. [sent-50, score-0.264]
31 All operators considered in what follows are Hilbert-Schmidt and the norm considered for these operators will be the Hilbert-Schmidt norm unless precised otherwise. [sent-51, score-0.372]
32 Let C denote the covariance operator of variable ϕ(X). [sent-53, score-0.269]
33 To simplify notation we assume that nonzero eigenvalues λ1 > λ2 > . [sent-54, score-0.182]
34 It is well-known that the optimal D-dimensional reconstruction space is SD = span{φ1 , . [sent-63, score-0.127]
35 The KPCA procedure approximates this objective by considering the empirical covariance operator, denoted Cn , and the subspace SD spanned by its first D eigenvectors. [sent-67, score-0.316]
36 We denote PSD , PSD the orthogonal projectors on b these spaces. [sent-68, score-0.107]
37 A non-asympotic bound on the (Hilbert-Schmidt) norm of the difference between the empirical and the true covariance operators; 2. [sent-71, score-0.247]
38 An operator perturbation result bounding the difference between spectral projectors of two operators by the norm of their difference. [sent-72, score-0.633]
39 2 n As for the second step, [7] provides the following perturbation bound (see also e. [sent-75, score-0.228]
40 2 ) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple positive eigenvalues λ 1 > 1 λ2 > . [sent-78, score-0.387]
41 Let B ∈ HS(H) be another symmetric operator such that B < δr /2 and (A + B) is still a positive operator with simple nonzero eigenvalues. [sent-82, score-0.456]
42 Pr (A + B)) denote the orthogonal projector onto the subspace spanned by the r-th eigenvector of A (resp. [sent-84, score-0.366]
43 Then, these projectors satisfy: Pr (A) − Pr (A + B) ≤ 2 B δr . [sent-86, score-0.064]
44 Remark about the Approximation Error of the Eigenvectors: let us recall that a control over the Hilbert-Schmidt norm of the projections onto eigenspaces imply a control on the approximation errors of the eigenvectors themselves. [sent-87, score-0.523]
45 Indeed, let φr , ψr denote the (normalized) r-th eigenvectors of the operators above with signs chosen so that φ r , ψr > 0. [sent-88, score-0.275]
46 Now, the orthogonal projector on the direct sum of the first D eigenspaces is the sum D r=1 Pr . [sent-90, score-0.332]
47 The disadvantage of this bound is that we are penalized on the one hand by the (inverse) gaps between the eigenvalues, and on the other by the dimension D (because we have to sum the inverse gaps from 1 to D). [sent-92, score-0.172]
48 In the next section we improve the operator perturbation bound to get an improved result where only the gap δD enters into account. [sent-93, score-0.519]
49 We first prove the following variant on the operator perturbation property which better corresponds to our needs by taking directly into account the projection on the first D eigenvectors at once. [sent-95, score-0.567]
50 The proof uses the same kind of techniques as in [7]. [sent-96, score-0.055]
51 Theorem 3 Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple nonzero eigenvalues λ1 > λ2 > . [sent-97, score-0.431]
52 Let B ∈ HS(H) be another symmetric operator such that 2 B < δD /2 and (A + B) is still a positive operator. [sent-101, score-0.222]
53 P D (A + B)) denote the orthogonal projector onto the subspace spanned by the first D eigenvectors A (resp. [sent-103, score-0.446]
54 Let SD , SD be the subspaces spanned by the first D eigenvectors of C, resp. [sent-107, score-0.231]
55 the 1 eigenvalues of C, if D > 0 is such that λD > 0, put δD = 2 (λD − λD+1 ) and BD = 2M δD 1+ ξ 2 . [sent-112, score-0.138]
56 2 Then provided that n ≥ BD , the following bound holds with probability at least 1 − e−ξ : BD P SD − P SD ≤ √ . [sent-113, score-0.106]
57 (3) The important point here is that the approximation error now only depends on D through the (inverse) gap between the D-th and (D + 1)-th eigenvalues. [sent-115, score-0.133]
58 Note that using the results of section 2, we would have obtained exactly the same bound for estimating the D-th eigenvector only – or even a worse bound since δD = δD ∧ δD−1 appears in this case. [sent-116, score-0.238]
59 Thus, at least from the point of view of this technique (which could still yield suboptimal bounds), there is no increase of complexity between estimating the D-th eigenvector alone and estimating the span of the first D eigenvectors. [sent-117, score-0.307]
60 Note that the inclusion (3) can be interpreted geometrically by saying that for any vector in SD , the √ tangent of the angle between this vector and its projection on SD is upper bounded by BD / n, which we can interpret as a stability property. [sent-118, score-0.128]
61 In the actual (K)PCA procedure, the data is actually first empirically recentered, so that one has to consider the centered covariance operator C and its empirical counterpart C n . [sent-120, score-0.386]
62 Indeed, a result similar to Lemma 1 holds for the recentered operators [2]. [sent-122, score-0.197]
63 4 Conclusion and Discussion In this paper, finite sample size confidence bounds of the eigenspaces of Kernel-PCA (the D-eigenspaces of the empirical covariance operator) are provided using tools of operator perturbation theory. [sent-124, score-0.707]
64 We proved a bound in which the complexity factor for estimating the eigenspace SD by its empirical counterpart depends only on the inverse gap between the D-th and (D + 1)-th eigenvalues. [sent-128, score-0.327]
65 In this work we assumed for simplicity of notation the eigenvalues to be simple. [sent-130, score-0.138]
66 In the case the covariance operator C has nonzero eigenvalues with multiplicities m1 , m2 , . [sent-131, score-0.451]
67 possibly larger than one, the analysis remains the same except for one point: we have to assume that the dimension D of the subspaces considered is of the form m1 + · · · + mr for a certain r. [sent-134, score-0.054]
68 This could seem restrictive in comparison with the results obtained for estimating the sum of the first D eigenvalues themselves [2] (which is linked to the reconstruction error in KPCA) where no such restriction appears. [sent-135, score-0.34]
69 However, it should be clear that we need this restriction when considering D−eigenspaces themselves since the target space has to be unequivocally defined, otherwise convergence cannot occur. [sent-136, score-0.094]
70 Thus, it can happen in this special case that the reconstruction error converges while the projection space itself does not. [sent-137, score-0.235]
71 Finally, a common point of the two analyses (over the spectrum and over the eigenspaces) lies in the fact that the bounds involve an inverse gap in the eigenvalues of the true covariance operator. [sent-138, score-0.395]
72 Finally, how tight are these bounds and do they at least carry some correct qualitative information about the behavior of the eigenspaces? [sent-139, score-0.063]
73 They imply that there is still important ground to cover before bridging the gap between asymptotic and non-asymptotic. [sent-141, score-0.154]
74 We give a short proof for the sake of n 1 completness. [sent-147, score-0.079]
75 We can apply the bounded difference inequality to the variable Cn − C , so that with probability greater than 1 − e−ξ , Cn − C ≤ E [ Cn − C ] + 2M Moreover, by Jensen’s inequality E [ Cn − C ] ≤ E n 1 simple calculations leads to E n i=1 CXi − E [CX ] 2 4M n . [sent-149, score-0.102]
76 The variation of this proof with respect to Theorem 5. [sent-153, score-0.055]
77 2 in [7] is (a) to work directly in a (infinite-dimensional) Hilbert space, requiring extra caution for some details and (b) obtaining an improved bound by considering D-eigenspaces at once. [sent-154, score-0.127]
78 The key property of Hilbert-Schmidt operators allowing to work directly in a infinite dimensional setting is that HS(H) is a both right and left ideal of Lc (H, H), the Banach space of all continuous linear operators of H endowed with the operator norm . [sent-155, score-0.55]
79 Indeed, ∀ T ∈ HS(H), ∀S ∈ Lc (H, H), T S and ST belong to HS(H) with TS ≤ T S ST ≤ T and op S op . [sent-157, score-0.386]
80 (4) The spectrum of an Hilbert-Schmidt operator T is denoted Λ(T ) and the sequence of eigenvalues in non-increasing order is denoted λ(T ) = (λ1 (T ) ≥ λ2 (T ) ≥ . [sent-158, score-0.328]
81 In the following, P D (T ) denotes the orthogonal projector onto the D-eigenspace of T . [sent-162, score-0.171]
82 The Hoffmann-Wielandt inequality in infinite dimensional setting[1] yields that: λ(A) − λ(A + B) 2 ≤ B ≤ δD . [sent-163, score-0.051]
83 (7) where RA (z) = (A − z Id)−1 is the resolvent of A, provided that γ is a simple closed curve in C enclosing exactly the first D eigenvalues of A and (A + B). [sent-167, score-0.138]
84 60) states that for ξ in the complementary of Λ(A), RA (ξ) op = dist(ξ, Λ(A)) −1 . [sent-169, score-0.193]
85 (8) The proof of the theorem now relies on the simple choice for the closed curve γ in (7), drawn in the picture below and consisting of three straight lines and a semi-circle of radius D L. [sent-170, score-0.119]
86 For all L > δ2 , γ intersect neither the eigenspectrum of A (by equation (6)) nor the eigenspectrum of A + B. [sent-171, score-0.116]
87 RA+B (z) = (Id + RA (z)B)−1 RA (z) and, for z ∈ γ and L > δD , RA (z)B op ≤ RA (z) op B ≤ δD 1 ≤ , 2 dist(z, Λ(A)) 2 γ L L δD λ 0 D+1 δD λ2 λD λ1 δD δD δD 2 2 2 L . [sent-184, score-0.386]
88 op imply that SN −→ RA+B (z) (uniformly for z ∈ γ). [sent-185, score-0.238]
89 Using again property (4) and (8) implies B n (RA − RA+B )(γ(t)) ≤ RA (γ(t)) n+1 B n ≤ op distn+1 (γ(t), Λ(A)) n≥1 Finally, since for L > δD , B ≤ δD 2 n≥1 ≤ dist(γ(t),Λ(A)) , 2 b B 1 |γ (t)|dt . [sent-189, score-0.216]
90 2 (γ(t), Λ(A)) π a dist Splitting the last integral into four parts according to the definition of the contour γ, we obtain P D (A) − P D (A + B) ≤ 2arctan( δL ) 1 µ1 (A) − (µD (A) − δD ) π D |γ (t)|dt ≤ + +2 , dist2 (γ(t), Λ(A)) δD L L2 a and letting L goes to infinity leads to the result. [sent-190, score-0.104]
91 ⊥ b Lemma 5 below with F = SD and G = SD , and the fact that the operator norm is bounded by the Hilbert-Schmidt norm imply that 4 PSD (f ) 2 k ≤ PSD − PSD 2 PSD (f ) 2 k . [sent-195, score-0.359]
92 Lemma 5 Let F and G be two vector subspaces of H such that PF − PG the following bound holds: 4 PF − PG 2 PF (f ) 2 . [sent-197, score-0.129]
93 = f − PF (f ) 2 = (PG − PF )(f ) 2 = PF − PF ⊥ (f ) 2 For f ∈ G, we have PG (f ) = f , hence PF (f ) 2 op PG 2 op ≤ P F − PG gathering the terms containing PF ⊥ (f ) 1/4 leads to the conclusion. [sent-200, score-0.422]
94 2 f 2 2 + PF ⊥ (f ) 2 on the left-hand side and using PF −PG 2 op ≤ References [1] R. [sent-201, score-0.193]
95 Kernel projection machine: a new tool for pattern recognition. [sent-225, score-0.078]
96 Asymptotics of spectral projections of some random matrices approximating integral operators. [sent-242, score-0.093]
97 Random matrix approximation of spectra of integral e operators. [sent-247, score-0.055]
98 Nonlinear component analysis as a o u kernel eigenvalue problem. [sent-255, score-0.088]
99 On the eigenspectrum of the Gram matrix and the generalisation error of Kernel PCA. [sent-268, score-0.088]
100 On the convergence of spectral clustering on random samples: the normalized case. [sent-281, score-0.076]
wordName wordTfidf (topN-words)
[('ra', 0.459), ('sd', 0.387), ('pf', 0.232), ('psd', 0.212), ('op', 0.193), ('kpca', 0.191), ('eigenspaces', 0.191), ('operator', 0.19), ('pg', 0.171), ('perturbation', 0.153), ('eigenvalues', 0.138), ('hs', 0.136), ('operators', 0.124), ('eigenvectors', 0.123), ('cn', 0.101), ('reconstruction', 0.1), ('subspace', 0.098), ('projector', 0.098), ('lemma', 0.094), ('cx', 0.093), ('bd', 0.085), ('gap', 0.08), ('covariance', 0.079), ('projection', 0.078), ('bound', 0.075), ('bra', 0.073), ('cxi', 0.073), ('dist', 0.072), ('kernel', 0.065), ('theorem', 0.064), ('projectors', 0.064), ('bounds', 0.063), ('norm', 0.062), ('eigenspectrum', 0.058), ('lc', 0.058), ('proof', 0.055), ('spanned', 0.054), ('subspaces', 0.054), ('pr', 0.053), ('inequality', 0.051), ('span', 0.05), ('blanchard', 0.049), ('pca', 0.047), ('hilbert', 0.047), ('estimating', 0.045), ('imply', 0.045), ('nonzero', 0.044), ('eigenvector', 0.043), ('orthogonal', 0.043), ('gin', 0.042), ('recentered', 0.042), ('spectral', 0.04), ('supx', 0.039), ('luxburg', 0.039), ('gathering', 0.036), ('convergence', 0.036), ('inverse', 0.035), ('sn', 0.035), ('complexity', 0.034), ('alone', 0.033), ('principal', 0.033), ('suboptimal', 0.032), ('von', 0.032), ('symmetric', 0.032), ('integral', 0.032), ('centered', 0.032), ('considering', 0.031), ('empirical', 0.031), ('gaps', 0.031), ('id', 0.031), ('holds', 0.031), ('error', 0.03), ('onto', 0.03), ('dt', 0.03), ('concludes', 0.03), ('interested', 0.029), ('asymptotic', 0.029), ('let', 0.028), ('consequence', 0.028), ('bousquet', 0.028), ('space', 0.027), ('actually', 0.027), ('linked', 0.027), ('counterpart', 0.027), ('hk', 0.025), ('stability', 0.025), ('view', 0.025), ('interpreted', 0.025), ('colt', 0.025), ('dimensionality', 0.024), ('proceedings', 0.024), ('short', 0.024), ('approximation', 0.023), ('procedure', 0.023), ('williams', 0.023), ('component', 0.023), ('property', 0.023), ('consistency', 0.022), ('improved', 0.021), ('projections', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
Author: Laurent Zwald, Gilles Blanchard
Abstract: This paper presents a non-asymptotic statistical analysis of Kernel-PCA with a focus different from the one proposed in previous work on this topic. Here instead of considering the reconstruction error of KPCA we are interested in approximation error bounds for the eigenspaces themselves. We prove an upper bound depending on the spacing between eigenvalues but not on the dimensionality of the eigenspace. As a consequence this allows to infer stability results for these estimated spaces. 1 Introduction. Principal Component Analysis (PCA for short in the sequel) is a widely used tool for data dimensionality reduction. It consists in finding the most relevant lower-dimension projection of some data in the sense that the projection should keep as much of the variance of the original data as possible. If the target dimensionality of the projected data is fixed in advance, say D – an assumption that we will make throughout the present paper – the solution of this problem is obtained by considering the projection on the span SD of the first D eigenvectors of the covariance matrix. Here by ’first D eigenvectors’ we mean eigenvectors associated to the D largest eigenvalues counted with multiplicity; hereafter with some abuse the span of the first D eigenvectors will be called “D-eigenspace” for short when there is no risk of confusion. The introduction of the ’Kernel trick’ has allowed to extend this methodology to data mapped in a kernel feature space, then called KPCA [8]. The interest of this extension is that, while still linear in feature space, it gives rise to nonlinear interpretation in original space – vectors in the kernel feature space can be interpreted as nonlinear functions on the original space. For PCA as well as KPCA, the true covariance matrix (resp. covariance operator) is not known and has to be estimated from the available data, an procedure which in the case of ¨ Kernel spaces is linked to the so-called Nystrom approximation [13]. The subspace given as an output is then obtained as D-eigenspace SD of the empirical covariance matrix or operator. An interesting question from a statistical or learning theoretical point of view is then, how reliable this estimate is. This question has already been studied [10, 2] from the point of view of the reconstruction error of the estimated subspace. What this means is that (assuming the data is centered in Kernel space for simplicity) the average reconstruction error (square norm of the distance to the projection) of SD converges to the (optimal) reconstruction error of SD and that bounds are known about the rate of convergence. However, this does not tell us much about the convergence of SD to SD – since two very different subspaces can have a very similar reconstruction error, in particular when some eigenvalues are very close to each other (the gap between the eigenvalues will actually appear as a central point of the analysis to come). In the present work, we set to study the behavior of these D-eigenspaces themselves: we provide finite sample bounds describing the closeness of the D-eigenspaces of the empirical covariance operator to the true one. There are several broad motivations for this analysis. First, the reconstruction error alone is a valid criterion only if one really plans to perform dimensionality reduction of the data and stop there. However, PCA is often used merely as a preprocessing step and the projected data is then submitted to further processing (which could be classification, regression or something else). In particular for KPCA, the projection subspace in the kernel space can be interpreted as a subspace of functions on the original space; one then expects these functions to be relevant for the data at hand and for some further task (see e.g. [3]). In these cases, if we want to analyze the full procedure (from a learning theoretical sense), it is desirable to have a more precise information on the selected subspace than just its reconstruction error. In particular, from a learning complexity point of view, it is important to ensure that functions used for learning stay in a set of limited complexity, which is ensured if the selected subspace is stable (which is a consequence of its convergence). The approach we use here is based on perturbation bounds and we essentially walk in the steps pioneered by Kolchinskii and Gin´ [7] (see also [4]) using tools of operator perturbae tion theory [5]. Similar methods have been used to prove consistency of spectral clustering [12, 11]. An important difference here is that we want to study directly the convergence of the whole subspace spanned by the first D eigenvectors instead of the separate convergence of the individual eigenvectors; in particular we are interested in how D acts as a complexity parameter. The important point in our main result is that it does not: only the gap between the D-th and the (D + 1)-th eigenvalue comes into account. This means that there in no increase in complexity (as far as this bound is concerned: of course we cannot exclude that better bounds can be obtained in the future) between estimating the D-th eigenvector alone or the span of the first D eigenvectors. Our contribution in the present work is thus • to adapt the operator perturbation result of [7] to D-eigenspaces. • to get non-asymptotic bounds on the approximation error of Kernel-PCA eigenspaces thanks to the previous tool. In section 2 we introduce shortly the notation, explain the main ingredients used and obtain a first bound based on controlling separately the first D eigenvectors, and depending on the dimension D. In section 3 we explain why the first bound is actually suboptimal and derive an improved bound as a consequence of an operator perturbation result that is more adapted to our needs and deals directly with the D-eigenspace as a whole. Section 4 concludes and discusses the obtained results. Mathematical proofs are found in the appendix. 2 First result. Notation. The interest variable X takes its values in some measurable space X , following the distribution P . We consider KPCA and are therefore primarily interested in the mapping of X into a reproducing kernel Hilbert space H with kernel function k through the feature mapping ϕ(x) = k(x, ·). The objective of the kernel PCA procedure is to recover a D-dimensional subspace SD of H such that the projection of ϕ(X) on SD has maximum averaged squared norm. All operators considered in what follows are Hilbert-Schmidt and the norm considered for these operators will be the Hilbert-Schmidt norm unless precised otherwise. Furthermore we only consider symmetric nonnegative operators, so that they can be diagonalized and have a discrete spectrum. Let C denote the covariance operator of variable ϕ(X). To simplify notation we assume that nonzero eigenvalues λ1 > λ2 > . . . of C are all simple (This is for convenience only. In the conclusion we discuss what changes have to be made if this is not the case). Let φ1 , φ2 , . . . be the associated eigenvectors. It is well-known that the optimal D-dimensional reconstruction space is SD = span{φ1 , . . . , φD }. The KPCA procedure approximates this objective by considering the empirical covariance operator, denoted Cn , and the subspace SD spanned by its first D eigenvectors. We denote PSD , PSD the orthogonal projectors on b these spaces. A first bound. Broadly speaking, the main steps required to obtain the type of result we are interested in are 1. A non-asympotic bound on the (Hilbert-Schmidt) norm of the difference between the empirical and the true covariance operators; 2. An operator perturbation result bounding the difference between spectral projectors of two operators by the norm of their difference. The combination of these two steps leads to our goal. The first step consists in the following Lemma coming from [9]: Lemma 1 (Corollary 5 of [9]) Supposing that supx∈X k(x, x) ≤ M , with probability greater than 1 − e−ξ , 2M ξ Cn − C ≤ √ 1+ . 2 n As for the second step, [7] provides the following perturbation bound (see also e.g. [12]): Theorem 2 (Simplified Version of [7], Theorem 5.2 ) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple positive eigenvalues λ 1 > 1 λ2 > . . . For an integer r such that λr > 0, let δr = δr ∧ δr−1 where δr = 2 (λr − λr+1 ). Let B ∈ HS(H) be another symmetric operator such that B < δr /2 and (A + B) is still a positive operator with simple nonzero eigenvalues. Let Pr (A) (resp. Pr (A + B)) denote the orthogonal projector onto the subspace spanned by the r-th eigenvector of A (resp. (A + B)). Then, these projectors satisfy: Pr (A) − Pr (A + B) ≤ 2 B δr . Remark about the Approximation Error of the Eigenvectors: let us recall that a control over the Hilbert-Schmidt norm of the projections onto eigenspaces imply a control on the approximation errors of the eigenvectors themselves. Indeed, let φr , ψr denote the (normalized) r-th eigenvectors of the operators above with signs chosen so that φ r , ψr > 0. Then P φ r − P ψr 2 2 = 2(1 − φr , ψr ) ≥ 2(1 − φr , ψr ) = φr − ψr 2 . Now, the orthogonal projector on the direct sum of the first D eigenspaces is the sum D r=1 Pr . Using the triangle inequality, and combining Lemma 1 and Theorem 2, we conclude that with probability at least 1 − e−ξ the following holds: D P SD − P SD ≤ b −1 δr r=1 4M √ n 1+ ξ 2 , 2 provided that n ≥ 16M 2 1 + ξ 2 −2 (sup1≤r≤D δr ) . The disadvantage of this bound is that we are penalized on the one hand by the (inverse) gaps between the eigenvalues, and on the other by the dimension D (because we have to sum the inverse gaps from 1 to D). In the next section we improve the operator perturbation bound to get an improved result where only the gap δD enters into account. 3 Improved Result. We first prove the following variant on the operator perturbation property which better corresponds to our needs by taking directly into account the projection on the first D eigenvectors at once. The proof uses the same kind of techniques as in [7]. Theorem 3 Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple nonzero eigenvalues λ1 > λ2 > . . . Let D > 0 be an integer such that λD > 0, δD = 1 (λD − λD+1 ). Let B ∈ HS(H) be another symmetric operator such that 2 B < δD /2 and (A + B) is still a positive operator. Let P D (A) (resp. P D (A + B)) denote the orthogonal projector onto the subspace spanned by the first D eigenvectors A (resp. (A + B)). Then these satisfy: P D (A) − P D (A + B) ≤ B . δD (1) This then gives rise to our main result on KPCA: Theorem 4 Assume that supx∈X k(x, x) ≤ M . Let SD , SD be the subspaces spanned by the first D eigenvectors of C, resp. Cn defined earlier. Denoting λ1 > λ2 > . . . the 1 eigenvalues of C, if D > 0 is such that λD > 0, put δD = 2 (λD − λD+1 ) and BD = 2M δD 1+ ξ 2 . 2 Then provided that n ≥ BD , the following bound holds with probability at least 1 − e−ξ : BD P SD − P SD ≤ √ . b n (2) This entails in particular ⊥ SD ⊂ g + h, g ∈ SD , h ∈ SD , h 1 Hk ≤ 2BD n− 2 g Hk . (3) The important point here is that the approximation error now only depends on D through the (inverse) gap between the D-th and (D + 1)-th eigenvalues. Note that using the results of section 2, we would have obtained exactly the same bound for estimating the D-th eigenvector only – or even a worse bound since δD = δD ∧ δD−1 appears in this case. Thus, at least from the point of view of this technique (which could still yield suboptimal bounds), there is no increase of complexity between estimating the D-th eigenvector alone and estimating the span of the first D eigenvectors. Note that the inclusion (3) can be interpreted geometrically by saying that for any vector in SD , the √ tangent of the angle between this vector and its projection on SD is upper bounded by BD / n, which we can interpret as a stability property. Comment about the Centered Case. In the actual (K)PCA procedure, the data is actually first empirically recentered, so that one has to consider the centered covariance operator C and its empirical counterpart C n . A result similar to Theorem 4 also holds in this case (up to some additional constant factors). Indeed, a result similar to Lemma 1 holds for the recentered operators [2]. Combined again with Theorem 3, this allows to come to similar conclusions for the “true” centered KPCA. 4 Conclusion and Discussion In this paper, finite sample size confidence bounds of the eigenspaces of Kernel-PCA (the D-eigenspaces of the empirical covariance operator) are provided using tools of operator perturbation theory. This provides a first step towards an in-depth complexity analysis of algorithms using KPCA as pre-processing, and towards taking into account the randomness of the obtained models (e.g. [3]). We proved a bound in which the complexity factor for estimating the eigenspace SD by its empirical counterpart depends only on the inverse gap between the D-th and (D + 1)-th eigenvalues. In addition to the previously cited works, we take into account the centering of the data and obtain comparable rates. In this work we assumed for simplicity of notation the eigenvalues to be simple. In the case the covariance operator C has nonzero eigenvalues with multiplicities m1 , m2 , . . . possibly larger than one, the analysis remains the same except for one point: we have to assume that the dimension D of the subspaces considered is of the form m1 + · · · + mr for a certain r. This could seem restrictive in comparison with the results obtained for estimating the sum of the first D eigenvalues themselves [2] (which is linked to the reconstruction error in KPCA) where no such restriction appears. However, it should be clear that we need this restriction when considering D−eigenspaces themselves since the target space has to be unequivocally defined, otherwise convergence cannot occur. Thus, it can happen in this special case that the reconstruction error converges while the projection space itself does not. Finally, a common point of the two analyses (over the spectrum and over the eigenspaces) lies in the fact that the bounds involve an inverse gap in the eigenvalues of the true covariance operator. Finally, how tight are these bounds and do they at least carry some correct qualitative information about the behavior of the eigenspaces? Asymptotic results (central limit Theorems) in [6, 4] always provide the correct goal to shoot for since they actually give the limit distributions of these quantities. They imply that there is still important ground to cover before bridging the gap between asymptotic and non-asymptotic. This of course opens directions for future work. Acknowledgements: This work was supported in part by the PASCAL Network of Excellence (EU # 506778). A Appendix: proofs. Proof of Lemma 1. This lemma is proved in [9]. We give a short proof for the sake of n 1 completness. Cn − C = n i=1 CXi − E [CX ] with CX = ϕ(X) ⊗ ϕ(X)∗ = k(X, X) ≤ M . We can apply the bounded difference inequality to the variable Cn − C , so that with probability greater than 1 − e−ξ , Cn − C ≤ E [ Cn − C ] + 2M Moreover, by Jensen’s inequality E [ Cn − C ] ≤ E n 1 simple calculations leads to E n i=1 CXi − E [CX ] 2 4M n . This concludes the proof of lemma 1. 1 n 2 n i=1 CXi 1 = nE 2 ξ 2n . 1 2 − E [CX ] , and CX − E [CX ] 2 ≤ Proof of Theorem 3. The variation of this proof with respect to Theorem 5.2 in [7] is (a) to work directly in a (infinite-dimensional) Hilbert space, requiring extra caution for some details and (b) obtaining an improved bound by considering D-eigenspaces at once. The key property of Hilbert-Schmidt operators allowing to work directly in a infinite dimensional setting is that HS(H) is a both right and left ideal of Lc (H, H), the Banach space of all continuous linear operators of H endowed with the operator norm . op . Indeed, ∀ T ∈ HS(H), ∀S ∈ Lc (H, H), T S and ST belong to HS(H) with TS ≤ T S ST ≤ T and op S op . (4) The spectrum of an Hilbert-Schmidt operator T is denoted Λ(T ) and the sequence of eigenvalues in non-increasing order is denoted λ(T ) = (λ1 (T ) ≥ λ2 (T ) ≥ . . .) . In the following, P D (T ) denotes the orthogonal projector onto the D-eigenspace of T . The Hoffmann-Wielandt inequality in infinite dimensional setting[1] yields that: λ(A) − λ(A + B) 2 ≤ B ≤ δD . 2 (5) implying in particular that ∀i > 0, |λi (A) − λi (A + B)| ≤ δD . 2 (6) Results found in [5] p.39 yield the formula P D (A) − P D (A + B) = − 1 2iπ γ (RA (z) − RA+B (z))dz ∈ Lc (H, H) . (7) where RA (z) = (A − z Id)−1 is the resolvent of A, provided that γ is a simple closed curve in C enclosing exactly the first D eigenvalues of A and (A + B). Moreover, the same reference (p.60) states that for ξ in the complementary of Λ(A), RA (ξ) op = dist(ξ, Λ(A)) −1 . (8) The proof of the theorem now relies on the simple choice for the closed curve γ in (7), drawn in the picture below and consisting of three straight lines and a semi-circle of radius D L. For all L > δ2 , γ intersect neither the eigenspectrum of A (by equation (6)) nor the eigenspectrum of A + B. Moreover, the eigenvalues of A (resp. A + B) enclosed by γ are exactly λ1 (A), . . . , λD (A) (resp. λ1 (A + B), . . . , λD (A + B)). Moreover, for z ∈ γ, T (z) = RA (z) − RA+B (z) = −RA+B (z)BRA (z) belongs to HS(H) and depends continuously on z by (4). Consequently, P D (A) − P D (A + B) ≤ 1 2π b a (RA − RA+B )(γ(t)) |γ (t)|dt . N Let SN = n=0 (−1)n (RA (z)B)n RA (z). RA+B (z) = (Id + RA (z)B)−1 RA (z) and, for z ∈ γ and L > δD , RA (z)B op ≤ RA (z) op B ≤ δD 1 ≤ , 2 dist(z, Λ(A)) 2 γ L L δD λ 0 D+1 δD λ2 λD λ1 δD δD δD 2 2 2 L . op imply that SN −→ RA+B (z) (uniformly for z ∈ γ). Using property (4), since B ∈ . HS(H), SN BRA (z) −→ RA+B (z)BRA (z) = RA+B (z) − RA (z) . Finally, RA (z) − RA+B (z) = (−1)n (RA (z)B)n RA (z) n≥1 where the series converges in HS(H), uniformly in z ∈ γ. Using again property (4) and (8) implies B n (RA − RA+B )(γ(t)) ≤ RA (γ(t)) n+1 B n ≤ op distn+1 (γ(t), Λ(A)) n≥1 Finally, since for L > δD , B ≤ δD 2 n≥1 ≤ dist(γ(t),Λ(A)) , 2 b B 1 |γ (t)|dt . 2 (γ(t), Λ(A)) π a dist Splitting the last integral into four parts according to the definition of the contour γ, we obtain P D (A) − P D (A + B) ≤ 2arctan( δL ) 1 µ1 (A) − (µD (A) − δD ) π D |γ (t)|dt ≤ + +2 , dist2 (γ(t), Λ(A)) δD L L2 a and letting L goes to infinity leads to the result. b Proof of Theorem 4. Lemma 1 and Theorem 3 yield inequality (2). Together with as1 2 sumption n ≥ BD it implies PSD − PSD ≤ 2 . Let f ∈ SD : f = PSD (f ) + PSD (f ) . ⊥ b Lemma 5 below with F = SD and G = SD , and the fact that the operator norm is bounded by the Hilbert-Schmidt norm imply that 4 PSD (f ) 2 k ≤ PSD − PSD 2 PSD (f ) 2 k . ⊥ b H H 3 Gathering the different inequalities, Theorem 4 is proved. Lemma 5 Let F and G be two vector subspaces of H such that PF − PG the following bound holds: 4 PF − PG 2 PF (f ) 2 . ∀ f ∈ G , PF ⊥ (f ) 2 ≤ H op H 3 op 1 ≤ 2 . Then Proof of Lemma 5. = f − PF (f ) 2 = (PG − PF )(f ) 2 = PF − PF ⊥ (f ) 2 For f ∈ G, we have PG (f ) = f , hence PF (f ) 2 op PG 2 op ≤ P F − PG gathering the terms containing PF ⊥ (f ) 1/4 leads to the conclusion. 2 f 2 2 + PF ⊥ (f ) 2 on the left-hand side and using PF −PG 2 op ≤ References [1] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in infinite dimensions. Proc.Indian Acad.Sci(Math. Sci.) 104 (3), p. 483-494, 1994. [2] G. Blanchard, O. Bousquet, and L. Zwald. Statistical Properties of Kernel Principal Component Analysis. Proceedings of the 17th. Conference on Learning Theory (COLT 2004), p. 594–608. Springer, 2004. [3] G. Blanchard, P. Massart, R. Vert, and L. Zwald. Kernel projection machine: a new tool for pattern recognition. Proceedings of the 18th. Neural Information Processing System (NIPS 2004), p. 1649–1656. MIT Press, 2004. [4] J. Dauxois, A. Pousse, and Y. Romain. Asymptotic theory for the Principal Component Analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis 12, 136-154, 1982. [5] T. Kato. Perturbation Theory for Linear Operators. New-York: Springer-Verlag, 1966. [6] V. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191–227, 1998. [7] V. Koltchinskii and E. Gin´ . Random matrix approximation of spectra of integral e operators. Bernoulli, 6(1):113–167, 2000. [8] B. Sch¨ lkopf, A. J. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a o u kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [9] J. Shawe-Taylor and N. Cristianini. Estimating the moments of a random vector with applications. Proceedings of the GRETSI 2003 Conference, p. 47-52, 2003. [10] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix and the generalisation error of Kernel PCA. IEEE Transactions on Information Theory 51 (7), p. 2510-2522, 2005. [11] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Technical Report 134, Max Planck Institute for Biological Cybernetics, 2004. [12] U. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering on random samples: the normalized case. Proceedings of the 17th Annual Conference on Learning Theory (COLT 2004), p. 457–471. Springer, 2004. [13] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. Proceedings of the 17th International Conference on Machine Learning (ICML), p. 1159–1166. Morgan Kaufmann, 2000.
2 0.29332355 134 nips-2005-Neural mechanisms of contrast dependent receptive field size in V1
Author: Jim Wielaard, Paul Sajda
Abstract: Based on a large scale spiking neuron model of the input layers 4Cα and β of macaque, we identify neural mechanisms for the observed contrast dependent receptive field size of V1 cells. We observe a rich variety of mechanisms for the phenomenon and analyze them based on the relative gain of excitatory and inhibitory synaptic inputs. We observe an average growth in the spatial extent of excitation and inhibition for low contrast, as predicted from phenomenological models. However, contrary to phenomenological models, our simulation results suggest this is neither sufficient nor necessary to explain the phenomenon.
3 0.14787248 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman
Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1
4 0.12620004 70 nips-2005-Fast Information Value for Graphical Models
Author: Brigham S. Anderson, Andrew W. Moore
Abstract: Calculations that quantify the dependencies between variables are vital to many operations with graphical models, e.g., active learning and sensitivity analysis. Previously, pairwise information gain calculation has involved a cost quadratic in network size. In this work, we show how to perform a similar computation with cost linear in network size. The loss function that allows this is of a form amenable to computation by dynamic programming. The message-passing algorithm that results is described and empirical results demonstrate large speedups without decrease in accuracy. In the cost-sensitive domains examined, superior accuracy is achieved.
5 0.11058544 126 nips-2005-Metric Learning by Collapsing Classes
Author: Amir Globerson, Sam T. Roweis
Abstract: We present an algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks. Our method relies on the simple geometric intuition that a good metric is one under which points in the same class are simultaneously near each other and far from points in the other classes. We construct a convex optimization problem whose solution generates such a metric by trying to collapse all examples in the same class to a single point and push examples in other classes infinitely far away. We show that when the metric we learn is used in simple classifiers, it yields substantial improvements over standard alternatives on a variety of problems. We also discuss how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance. 1 Supervised Learning of Metrics The problem of learning a distance measure (metric) over an input space is of fundamental importance in machine learning [10, 9], both supervised and unsupervised. When such measures are learned directly from the available data, they can be used to improve learning algorithms which rely on distance computations such as nearest neighbour classification [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised clustering algorithms [10]. Good similarity measures may also provide insight into the underlying structure of data (e.g. inter-protein distances), and may aid in building better data visualizations via embedding. In fact, there is a close link between distance learning and feature extraction since whenever we construct a feature ´Üµ for an input using a simple distance funcspace , we can measure distances between ܽ ܾ ¾ tion (e.g. Euclidean) ´Ü½ µ ´Ü¾ µ℄ in feature space. Thus by fixing , any feature extraction algorithm may be considered a metric learning method. Perhaps the simplest illustration of this approach is when the ´Üµ is a linear projection of Ü ¾ Ö so that ´Üµ Ï Ü. The Euclidean distance between ´Ü½ µ and ´Ü¾ µ is then the Mahalanobis distance ´Ü½ µ ´Ü¾ µ ¾ ´Ü½ ܾ µÌ ´Ü½ ܾ µ, where Ï Ì Ï is a positive semidefinite matrix. Much of the recent work on metric learning has indeed focused on learning Mahalanobis distances, i.e. learning the matrix . This is also the goal of the current work. A common approach to learning metrics is to assume some knowledge in the form of equiv- alence relations, i.e. which points should be close and which should be far (without specifying their exact distances). In the classification setting there is a natural equivalence relation, namely whether two points are in the same class or not. One of the classical statistical methods which uses this idea for the Mahalanobis distance is Fisher’s Linear Discriminant Analysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize various separation criteria between the classes under the new metric. In this work, we present a novel approach to learning such a metric. Our approach, the Maximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric intuition that if all points in the same class could be mapped into a single location in feature space and all points in other classes mapped to other locations, this would result in an ideal approximation of our equivalence relation. Our algorithm approximates this scenario via a stochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However, unlike NCA, the optimization problem is convex and thus our method is completely specified by our objective function. Different initialization and optimization techniques may affect the speed of obtaining the solution but the final solution itself is unique. We also show that our method approximates the local covariance structure of the data, as opposed to Linear Discriminant Analysis methods which use only global covariance structure. 2 The Approach of Collapsing Classes Given a set of Ò labeled examples ´Ü Ý µ, where Ü Ý ¾ ½ , we seek a space. We focus on Mahalanobis form metrics similarity measure between two points in ´Ü where Ü ´Ü µ ¾ Ü µÌ Ö and ´Ü Ü µ (1) is a positive semidefinite (PSD) matrix. Intuitively, what we want from a good metric is that it makes elements of in the same class look close whereas those in different classes appear far. Our approach starts with the ideal case when this is true in the most optimistic sense: same class points are at zero distance, and different class points are infinitely far. Alternatively this can be viewed as mapping Ü via a linear projection Ï Ü ( Ï Ì Ï ), such that all points in the same class are mapped into the same point. This intuition is related to the analysis of spectral clustering [8], where the ideal case analysis of the algorithm results in all same cluster points being mapped to a single point. To learn a metric which approximates the ideal geometric setup described above, we introduce, for each training point, a conditional distribution over other points (as in [5]). such that Specifically, for each Ü we define a conditional distribution over points Ô ´ µ ½ È (2) If all points in the same class were mapped to a single point and infinitely far from points in different classes, we would have the ideal “bi-level” distribution: Ô¼ ´ µ » Ò ½ ¼ Ý Ý Ý Ý (3) Furthermore, under very mild conditions, any set of points which achieves the above distribution must have the desired geometry. In particular, assume there are at least Ö · ¾ points in each class, where Ö rank ℄ (note that Ö Ö). Then Ô ´ µ Ô¼ ´ µ ( ) implies that under all points in the same class will be mapped to a single point, infinitely far from other class points 1 . 1 Proof sketch: The infinite separation between points of different classes follows simply from Thus it is natural to seek a matrix such that Ô ´ µ is as close as possible to Ô¼ ´ Since we are trying to match distributions, we minimize the KL divergence ÃÄ Ô¼ Ô℄: ÑÒ ÃÄ Ô¼ ´ µ Ô ´ µ℄ ¾ ÈË ×Ø µ. (4) The crucial property of this optimization problem is that it is convex in the matrix . To see this, first note that any convex linear combination of feasible solutions « ¼ · ´½ «µ ½ s.t. ¼ « ½ is still a feasible solution, since the set of PSD matrices is convex. Next, we can show that ´ µ alway has a greater cost than either of the endpoints. To do this, we rewrite the objective function ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ in the form 2 : È ´ µ ÐÓ Ý Ý Ô´ µ · Ý ÐÓ Ý where we assumed for simplicity that classes are equi-probable, yielding a multiplicative constant. To see why ´ µ is convex, first note that ´Ü Ü µÌ ´Ü Ü µ is linear is a ÐÓ ÜÔ function of affine functions of in , and thus convex. The function ÐÓ and is therefore also convex (see [4], page 74). È 2.1 Convex Duality Since our optimization problem is convex, it has an equivalent convex dual. Specifically, the convex dual of Eq. (4) is the following entropy maximization problem: À Ô´ Ñ Ü Ô´ µ where Ú Ü ×Ø µ℄ Ô¼ ´ Ú ÚÌ ℄ µ Ô´ µ Ú ÚÌ ℄ Ü , À ¡℄ is the entropy function and we require È Ô´ µ ¼ ½ (5) . To prove this duality we start with the proposed dual and obtain the original problem in Equation 4 as its dual. Write the Lagrangian for the above problem (where is PSD) 3 Ä´Ô ¬µ À ´Ô´ µµ Ì Ö´ ´ Ô¼ ´ Ú ÚÌ ℄ Ô Ú ÚÌ ℄µµµ ¬ ´ Ô´ µ ½µ The dual function is defined as ´ ¬ µ Ñ ÒÔ Ä´Ô ¬ µ. To derive it, we first solve for the minimizing Ô by setting the derivative of Ä´Ô ¬ µ w.r.t. Ô´ µ equal to zero. Ì ¼ ½ · ÐÓ Ô´ µ · Ì Ö ´ Ú Ú µ ¬ µ Ô´ µ ¬ ½ Ì Ö´ Ú ÚÌ µ Plugging solution È Ô thisThe dual to Ä Ô is¬ towe get ¬ . problem maximize È Ì Ö Ú Ú . ¬ , yielding ¬ ´ ´ µ ´ µ ½ Now note that Ì Ö´ ÐÓ Ú ÚÌ µ ÚÌ Ú ´ µ ´ ̵ ´ µ Ì Ö´ ¬ µ. È Ô¼ Ú ÚÌ ℄µ · Ȭ We can do this analytically w.r.t. , so we can write Ý Ý ÐÓ which is minus our original target function. Since ´ µ should be maximized, and we have the desired duality result (identifying with ). ¼ » µ ¼ when Ý Ý . For a given point Ü , all the points in its class satisfy Ô´ µ ½. Due to the structure of Ô´ µ in Equation 2, and because it is obeyed for all points in ܼ × class, this implies that all the points in that class are equidistant from each other. However, it is easy to show that the maximum number of different equidistant points (also known as the equilateral dimension [1]) in Ö dimensions is Ö · ½. Since by assumption we have at least Ö · ¾ points in the class of Ü , and maps points into Ö , it follows that all points are identical. 2 À Ô¼ ´ µ℄. Up to an additive constant 3 We consider the equivalent problem of minimizing minus entropy. Ô¼ ´ È 2.1.1 Relation to covariance based and embedding methods The convex dual derived above reveals an interesting relation to covariance based learning methods. The sufficient statistics used by the algorithm are a set of Ò “spread” matrices. Each matrix is of the form Ô¼ ´ µ Ú ÚÌ ℄. The algorithm tries to find a maximum entropy distribution which matches these matrices when averaged over the sample. This should be contrasted with the covariance matrices used in metric learning such as Fisher’s Discriminant Analysis. The latter uses the within and between class covariance matrices. The within covariance matrix is similar to the covariance matrix used here, but is calculated with respect to the class means, whereas here it is calculated separately for every point, and is centered on this point. This highlights the fact that MCML is not based on Gaussian assumptions where it is indeed sufficient to calculate a single class covariance. Our method can also be thought of as a supervised version of the Stochastic Neighbour Embedding algorithm [7] in which the “target” distribution is Ô¼ (determined by the class labels) and the embedding points are not completely free but are instead constrained to be of the form Ï Ü . 2.2 Optimizing the Convex Objective Since the optimization problem in Equation 4 is convex, it is guaranteed to have only a single minimum which is the globally optimal solution4 . It can be optimized using any appropriate numerical convex optimization machinery; all methods will yield the same solution although some may be faster than others. One standard approach is to use interior point Newton methods. However, these algorithms require the Hessian to be calculated, which would require Ç´ µ resources, and could be prohibitive in our case. Instead, we have experimented with using a first order gradient method, specifically the projected gradient approach as in [10]. At each iteration we take a small step in the direction of the negative gradient of the objective function5, followed by a projection back onto the PSD cone. This projection is performed simply by taking the eigen-decomposition of and removing the components with negative eigenvalues. The algorithm is summarized below: Ý µ, Ò Input: Set of labeled data points ´Ü Output: PSD metric which optimally collapses classes. ½ Initialization: Initialize ¼ to some PSD matrix (randomly or using some initialization heuristic). Iterate: È Ø ¯ ÔØ where Ü Ô Ü ¯ Calculate the eigen-decomposition of È È Ù ÙÌ , then set Ø Ø Ø ¯ Set Ø·½ ´ µ ´ ´ ¼´ µ µ ´ µµ´ µ´Ü Ü µÌ ·½ ·½ ·½ Ñ Ü´ ¼µÙ ÙÌ Of course in principle it is possible to optimize over the dual instead of the primal but in our case, if the training data consists of Ò points in Ö-dimensional space then the primal has only Ç´Ö¾ ¾µ variables while the dual has Ç´Ò¾ µ so it will almost always be more efficient to operate on the primal directly. One exception to this case may be the kernel version (Section 4) where the primal is also of size Ç´Ò¾ µ. 4 When the data can be exactly collapsed into single class points, there will be multiple solutions at infinity. However, this is very unlikely to happen in real data. 5 In the experiments, we used an Armijo like step size rule, as described in [3]. 3 Low Dimensional Projections for Feature Extraction The Mahalanobis distance under a metric can be interpreted as a linear projection of the original inputs by the square root of , followed by Euclidean distance in the projected space. Matrices which have less than full rank correspond to Mahalanobis distances based on low dimensional projections. Such metrics and the induced distances can be advantageous for several reasons [5]. First, low dimensional projections can substantially reduce the storage and computational requirements of a supervised method since only the projections of the training points must be stored and the manipulations at test time all occur in the lower dimensional feature space. Second, low dimensional projections re-represent the inputs, allowing for a supervised embedding or visualization of the original data. If we consider matrices with rank at most Õ , we can always represent them in the form Ï Ì Ï for some projection matrix Ï of size Õ ¢ Ö. This corresponds to projecting the original data into a Õ -dimensional space specified by the rows of Ï . However, rank constraints on a matrix are not convex [4], and hence the rank constrained problem is not convex and is likely to have local minima which make the optimization difficult and illdefined since it becomes sensitive to initial conditions and choice of optimization method. Luckily, there is an alternative approach to obtaining low dimensional projections, which does specify a unique solution by sequentially solving two globally tractable problems. This is the approach we follow here. First we solve for a (potentially) full rank metric using the convex program outlined above, and then obtain a low rank projection from it via spectral decomposition. This is done by diagonalizing into the form Ö Ú ÚÌ where ½ and Ú are the corre¾ Ö are eigenvalues of ½ sponding eigenvectors. To obtain a low rank projection we constrain the sum above to Õ include only the Õ terms corresponding to the Õ largest eigenvalues: Õ Ú ÚÌ . ½ The resulting projection is uniquely defined (up to an irrelevant unitary transformation) as Ô Ì Ì Ï ´ ÚÕ ℄. ½ Õ µ Ú½ È È Ô In general, the projection returned by this approach is not guaranteed to be the same as the projection corresponding to minimizing our objective function subject to a rank constraint on unless the optimal metric is of rank less than or equal to Õ . However, as we show in the experimental results, it is often the case that for practical problems the optimal has an eigen-spectrum which is rapidly decaying, so that many of its eigenvalues are indeed very small, suggesting the low rank solution will be close to optimal. 4 Learning Metrics with Kernels It is interesting to consider the case where Ü are mapped into a high dimensional feature space ´Ü µ and a Mahalanobis distance is sought in this space. We focus on the case where dot products in the feature space may be expressed via a kernel function, such that ´Ü µ ¡ ´Ü µ ´Ü Ü µ for some kernel . We now show how our method can be changed to accommodate this setting, so that optimization depends only on dot products. Consider the regularized target function: Ê ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ · Ì Ö´ µ (6) where the regularizing factor is equivalent to the Frobenius norm of the projection matrix Ï since Ì Ö´ µ Ï ¾. Deriving w.r.t. Ï we obtain Ï Í , where Í is some matrix which specifies Ï as a linear combination of sample points, and the Ø row of the matrix Ì Í Ì Í . Defining the PSD matrix is Ü . Thus is given by Í Ì Í , we can recast our optimization as looking for a PSD matrix , where the Mahalanobis distance is ´Ü Ü µÌ Ì ´Ü Ü µ ´ µÌ ´ µ, where we define Ü. This is exactly our original distance, with Ü replaced by , which depends only on dot products in space. The regularization term also depends solely on the dot products since Ì Ö´ µ Ì Ö´ Ì µ Ì Ö´ Ì µ Ì Ö´Ã µ, where à is the kernel matrix given Ì . Note that the trace is a linear function of , keeping the problem convex. by à Thus, as long as dot products can be represented via kernels, the optimization can be carried out without explicitly using the high dimensional space. To obtain a low dimensional solution, we follow the approach in Section 3: obtain a decomposition Î Ì Î 6 , and take the projection matrix to be the first Õ rows of ¼ Î . Ì , and thus Ì Ì As a first step, we calculate a matrix such that . Since is a correlation matrix for the rows of it can be shown (as in Kernel PCA) that its (left) eigenvectors are linear combinations of the rows of . Denoting by Î « the eigenvector matrix, we obtain, after some algebra, that « Ã Ì «. We conclude that « is an eigenvector of the matrix Ã Ì . Denote by « the matrix whose rows are orthonormal eigenvectors of Ã Ì . Then Î can be shown to be orthonormal if we set ¼ « . The final projection will then be ¼ Î Ü « . Low dimensional Î projections will be obtained by keeping only the first Õ components of this projection. 5 Experimental Results We compared our method to several metric learning algorithms on a supervised classification task. Training data was first used to learn a metric over the input space. Then this metric was used in a 1-nearest-neighbor algorithm to classify a test set. The datasets we investigated were taken from the UCI repository and have been used previously in evaluating supervised methods for metric learning [10, 5]. To these we added the USPS handwritten digits (downsampled to 8x8 pixels) and the YALE faces [2] (downsampled to 31x22). The algorithms used in the comparative evaluation were ¯ ¯ ¯ Fisher’s Linear Discriminant Analysis (LDA), which projects on the eigenvectors of ËϽ Ë where ËÏ Ë are the within and between class covariance matrices. The method of Xing et al [10] which minimizes the mean within class distance, while keeping the mean between class distance larger than one. Principal Component Analysis (PCA). There are several possibilities for scaling the PCA projections. We tested several, and report results of the empirically superior one (PCAW), which scales the projection components so that the covariance matrix after projection is the identity. PCAW often performs poorly on high dimensions, but globally outperforms all other variants. We also evaluated the kernel version of MCML with an RBF kernel (denoted by KMCML)7 . Since all methods allow projections to lower dimensions we compared performance for different projection dimensions 8 . The out-of sample performance results (based on 40 random splits of the data taking 70% for training and 30% for testing9 ) are shown in Figure 1. It can be seen that when used in a simple nearest-neighbour classifier, the metric learned by MCML almost always performs as well as, or significantly better than those learned by all other methods, across most dimensions. Furthermore, the kernel version of MCML outperforms the linear one on most datasets. Where Î is orthonormal, and the eigenvalues in are sorted in decreasing order. The regularization parameter and the width of the RBF kernel were chosen using 5 fold crossvalidation. KMCML was only evaluated for datasets with less than 1000 training points. 8 To obtain low dimensional mappings we used the approach outlined in Section 3. 9 Except for the larger datasets where 1000 random samples were used for training. 6 7 Wine 0.2 0.1 2 4 6 8 10 Projection Dimension 0.5 Error Rate Error Rate Balance Ion MCML PCAW LDA XING KMCML 0.3 0.25 0.4 0.2 0.15 0.3 0.1 0.2 0.05 10 20 30 Projection Dimension Soybean−small 0.1 1 2 3 Protein 4 Spam 0.6 0.4 0.25 0.5 0.4 0 5 10 15 0.2 0.3 0.2 0.15 20 5 10 Yale7 15 20 0.1 10 20 30 40 50 40 50 Digits Housing 0.6 0.4 0.4 0.3 0.4 0.35 0.2 0.3 0.2 0.1 0.25 10 20 30 40 50 5 10 15 10 20 30 Figure 1: Classification error rate on several UCI datasets, USPS digits and YALE faces, for different projection dimensions. Algorithms are our Maximally Collapsing Metric Learning (MCML), Xing et.al.[10], PCA with whitening transformation (PCAW) and Fisher’s Discriminant Analysis (LDA). Standard errors of the means shown on curves. No results given for XING on YALE and KMCML on Digits and Spam due to the data size. 5.1 Comparison to non convex procedures The methods in the previous comparison are all well defined, in the sense that they are not susceptible to local minima in the optimization. They also have the added advantage of obtaining projections to all dimensions using one optimization run. Below, we also compare the MCML results to the results of two non-convex procedures. The first is the Non Convex variant of MCML (NMCML): The objective function of MCML can be optimized w.r.t the projection matrix Ï , where Ï Ì Ï . Although this is no longer a convex problem, it is not constrained and is thus easier to optimize. The second non convex method is Neighbourhood Components Analysis (NCA) [5], which attempts to directly minimize the error incurred by a nearest neighbor classifier. For both methods we optimized the matrix Ï by restarting the optimization separately for each size of Ï . Minimization was performed using a conjugate gradient algorithm, initialized by LDA or randomly. Figure 2 shows results on a subset of the UCI datasets. It can be seen that the performance of NMCML is similar to that of MCML, although it is less stable, possibly due to local minima, and both methods usually outperform NCA. The inset in each figure shows the spectrum of the MCML matrix , revealing that it often drops quickly after a few dimensions. This illustrates the effectiveness of our two stage optimization procedure, and suggests its low dimensional solutions are close to optimal. 6 Discussion and Extensions We have presented an algorithm for learning maximally collapsing metrics (MCML), based on the intuition of collapsing classes into single points. MCML assumes that each class Balance Error Rate Wine MCML NMCML NCA 0.2 Soybean 0.4 0.1 0.3 0.1 0.05 0.2 0 2 4 6 8 10 Projection Dimension 0.1 1 0 2 Protein 3 4 5 10 Ion 15 20 Housing 0.4 0.6 0.3 0.5 0.35 0.2 0.4 0.3 0.3 0.25 5 10 15 20 0.1 10 20 30 5 10 15 Figure 2: Classification error for non convex procedures, and the MCML method. Eigen-spectra for the MCML solution are shown in the inset. may be collapsed to a single point, at least approximately, and thus is only suitable for unimodal class distributions (or for simply connected sets if kernelization is used). However, if points belonging to a single class appear in several disconnected clusters in input (or feature) space, it is unlikely that MCML could collapse the class into a single point. It is possible that using a mixture of distributions, an EM-like algorithm can be constructed to accommodate this scenario. The method can also be used to learn low dimensional projections of the input space. We showed that it performs well, even across a range of projection dimensions, and consistently outperforms existing methods. Finally, we have shown how the method can be extended to projections in high dimensional feature spaces using the kernel trick. The resulting nonlinear method was shown to improve classification results over the linear version. References Ò [1] N. Alon and P. Pudlak. Equilateral sets in ÐÔ . Geom. Funct. Anal., 13(3), 2003. [2] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In ECCV (1), 1996. [3] D.P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transaction on Automatic Control, 21(2):174–184, 1976. [4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004. [5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. [6] T. Hastie, R. Tibshirani, and J.H. Friedman. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, 2001. [7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), 2002. [8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2001. [9] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proc. of ECCV, 2002. [10] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), 2004.
6 0.10526136 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
7 0.09997049 47 nips-2005-Consistency of one-class SVM and related algorithms
8 0.087142646 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
9 0.080746785 85 nips-2005-Generalization to Unseen Cases
10 0.079444811 182 nips-2005-Statistical Convergence of Kernel CCA
11 0.068680026 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains
12 0.064056441 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
13 0.061137106 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
14 0.061110031 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
15 0.060251951 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
16 0.05793909 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
17 0.056280296 58 nips-2005-Divergences, surrogate loss functions and experimental design
18 0.055471726 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
19 0.054771379 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
20 0.054491814 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
topicId topicWeight
[(0, 0.181), (1, 0.044), (2, -0.054), (3, -0.118), (4, -0.026), (5, 0.045), (6, -0.136), (7, -0.12), (8, -0.036), (9, 0.004), (10, 0.123), (11, 0.147), (12, -0.019), (13, -0.076), (14, -0.126), (15, 0.24), (16, -0.001), (17, -0.273), (18, -0.17), (19, 0.189), (20, 0.029), (21, -0.206), (22, 0.051), (23, 0.219), (24, 0.023), (25, 0.041), (26, -0.012), (27, -0.065), (28, -0.149), (29, 0.02), (30, 0.106), (31, -0.131), (32, -0.126), (33, -0.085), (34, -0.018), (35, -0.139), (36, -0.067), (37, 0.041), (38, -0.019), (39, -0.001), (40, 0.011), (41, 0.025), (42, -0.019), (43, -0.014), (44, 0.014), (45, -0.032), (46, 0.073), (47, -0.048), (48, 0.032), (49, 0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.94946069 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
Author: Laurent Zwald, Gilles Blanchard
Abstract: This paper presents a non-asymptotic statistical analysis of Kernel-PCA with a focus different from the one proposed in previous work on this topic. Here instead of considering the reconstruction error of KPCA we are interested in approximation error bounds for the eigenspaces themselves. We prove an upper bound depending on the spacing between eigenvalues but not on the dimensionality of the eigenspace. As a consequence this allows to infer stability results for these estimated spaces. 1 Introduction. Principal Component Analysis (PCA for short in the sequel) is a widely used tool for data dimensionality reduction. It consists in finding the most relevant lower-dimension projection of some data in the sense that the projection should keep as much of the variance of the original data as possible. If the target dimensionality of the projected data is fixed in advance, say D – an assumption that we will make throughout the present paper – the solution of this problem is obtained by considering the projection on the span SD of the first D eigenvectors of the covariance matrix. Here by ’first D eigenvectors’ we mean eigenvectors associated to the D largest eigenvalues counted with multiplicity; hereafter with some abuse the span of the first D eigenvectors will be called “D-eigenspace” for short when there is no risk of confusion. The introduction of the ’Kernel trick’ has allowed to extend this methodology to data mapped in a kernel feature space, then called KPCA [8]. The interest of this extension is that, while still linear in feature space, it gives rise to nonlinear interpretation in original space – vectors in the kernel feature space can be interpreted as nonlinear functions on the original space. For PCA as well as KPCA, the true covariance matrix (resp. covariance operator) is not known and has to be estimated from the available data, an procedure which in the case of ¨ Kernel spaces is linked to the so-called Nystrom approximation [13]. The subspace given as an output is then obtained as D-eigenspace SD of the empirical covariance matrix or operator. An interesting question from a statistical or learning theoretical point of view is then, how reliable this estimate is. This question has already been studied [10, 2] from the point of view of the reconstruction error of the estimated subspace. What this means is that (assuming the data is centered in Kernel space for simplicity) the average reconstruction error (square norm of the distance to the projection) of SD converges to the (optimal) reconstruction error of SD and that bounds are known about the rate of convergence. However, this does not tell us much about the convergence of SD to SD – since two very different subspaces can have a very similar reconstruction error, in particular when some eigenvalues are very close to each other (the gap between the eigenvalues will actually appear as a central point of the analysis to come). In the present work, we set to study the behavior of these D-eigenspaces themselves: we provide finite sample bounds describing the closeness of the D-eigenspaces of the empirical covariance operator to the true one. There are several broad motivations for this analysis. First, the reconstruction error alone is a valid criterion only if one really plans to perform dimensionality reduction of the data and stop there. However, PCA is often used merely as a preprocessing step and the projected data is then submitted to further processing (which could be classification, regression or something else). In particular for KPCA, the projection subspace in the kernel space can be interpreted as a subspace of functions on the original space; one then expects these functions to be relevant for the data at hand and for some further task (see e.g. [3]). In these cases, if we want to analyze the full procedure (from a learning theoretical sense), it is desirable to have a more precise information on the selected subspace than just its reconstruction error. In particular, from a learning complexity point of view, it is important to ensure that functions used for learning stay in a set of limited complexity, which is ensured if the selected subspace is stable (which is a consequence of its convergence). The approach we use here is based on perturbation bounds and we essentially walk in the steps pioneered by Kolchinskii and Gin´ [7] (see also [4]) using tools of operator perturbae tion theory [5]. Similar methods have been used to prove consistency of spectral clustering [12, 11]. An important difference here is that we want to study directly the convergence of the whole subspace spanned by the first D eigenvectors instead of the separate convergence of the individual eigenvectors; in particular we are interested in how D acts as a complexity parameter. The important point in our main result is that it does not: only the gap between the D-th and the (D + 1)-th eigenvalue comes into account. This means that there in no increase in complexity (as far as this bound is concerned: of course we cannot exclude that better bounds can be obtained in the future) between estimating the D-th eigenvector alone or the span of the first D eigenvectors. Our contribution in the present work is thus • to adapt the operator perturbation result of [7] to D-eigenspaces. • to get non-asymptotic bounds on the approximation error of Kernel-PCA eigenspaces thanks to the previous tool. In section 2 we introduce shortly the notation, explain the main ingredients used and obtain a first bound based on controlling separately the first D eigenvectors, and depending on the dimension D. In section 3 we explain why the first bound is actually suboptimal and derive an improved bound as a consequence of an operator perturbation result that is more adapted to our needs and deals directly with the D-eigenspace as a whole. Section 4 concludes and discusses the obtained results. Mathematical proofs are found in the appendix. 2 First result. Notation. The interest variable X takes its values in some measurable space X , following the distribution P . We consider KPCA and are therefore primarily interested in the mapping of X into a reproducing kernel Hilbert space H with kernel function k through the feature mapping ϕ(x) = k(x, ·). The objective of the kernel PCA procedure is to recover a D-dimensional subspace SD of H such that the projection of ϕ(X) on SD has maximum averaged squared norm. All operators considered in what follows are Hilbert-Schmidt and the norm considered for these operators will be the Hilbert-Schmidt norm unless precised otherwise. Furthermore we only consider symmetric nonnegative operators, so that they can be diagonalized and have a discrete spectrum. Let C denote the covariance operator of variable ϕ(X). To simplify notation we assume that nonzero eigenvalues λ1 > λ2 > . . . of C are all simple (This is for convenience only. In the conclusion we discuss what changes have to be made if this is not the case). Let φ1 , φ2 , . . . be the associated eigenvectors. It is well-known that the optimal D-dimensional reconstruction space is SD = span{φ1 , . . . , φD }. The KPCA procedure approximates this objective by considering the empirical covariance operator, denoted Cn , and the subspace SD spanned by its first D eigenvectors. We denote PSD , PSD the orthogonal projectors on b these spaces. A first bound. Broadly speaking, the main steps required to obtain the type of result we are interested in are 1. A non-asympotic bound on the (Hilbert-Schmidt) norm of the difference between the empirical and the true covariance operators; 2. An operator perturbation result bounding the difference between spectral projectors of two operators by the norm of their difference. The combination of these two steps leads to our goal. The first step consists in the following Lemma coming from [9]: Lemma 1 (Corollary 5 of [9]) Supposing that supx∈X k(x, x) ≤ M , with probability greater than 1 − e−ξ , 2M ξ Cn − C ≤ √ 1+ . 2 n As for the second step, [7] provides the following perturbation bound (see also e.g. [12]): Theorem 2 (Simplified Version of [7], Theorem 5.2 ) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple positive eigenvalues λ 1 > 1 λ2 > . . . For an integer r such that λr > 0, let δr = δr ∧ δr−1 where δr = 2 (λr − λr+1 ). Let B ∈ HS(H) be another symmetric operator such that B < δr /2 and (A + B) is still a positive operator with simple nonzero eigenvalues. Let Pr (A) (resp. Pr (A + B)) denote the orthogonal projector onto the subspace spanned by the r-th eigenvector of A (resp. (A + B)). Then, these projectors satisfy: Pr (A) − Pr (A + B) ≤ 2 B δr . Remark about the Approximation Error of the Eigenvectors: let us recall that a control over the Hilbert-Schmidt norm of the projections onto eigenspaces imply a control on the approximation errors of the eigenvectors themselves. Indeed, let φr , ψr denote the (normalized) r-th eigenvectors of the operators above with signs chosen so that φ r , ψr > 0. Then P φ r − P ψr 2 2 = 2(1 − φr , ψr ) ≥ 2(1 − φr , ψr ) = φr − ψr 2 . Now, the orthogonal projector on the direct sum of the first D eigenspaces is the sum D r=1 Pr . Using the triangle inequality, and combining Lemma 1 and Theorem 2, we conclude that with probability at least 1 − e−ξ the following holds: D P SD − P SD ≤ b −1 δr r=1 4M √ n 1+ ξ 2 , 2 provided that n ≥ 16M 2 1 + ξ 2 −2 (sup1≤r≤D δr ) . The disadvantage of this bound is that we are penalized on the one hand by the (inverse) gaps between the eigenvalues, and on the other by the dimension D (because we have to sum the inverse gaps from 1 to D). In the next section we improve the operator perturbation bound to get an improved result where only the gap δD enters into account. 3 Improved Result. We first prove the following variant on the operator perturbation property which better corresponds to our needs by taking directly into account the projection on the first D eigenvectors at once. The proof uses the same kind of techniques as in [7]. Theorem 3 Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple nonzero eigenvalues λ1 > λ2 > . . . Let D > 0 be an integer such that λD > 0, δD = 1 (λD − λD+1 ). Let B ∈ HS(H) be another symmetric operator such that 2 B < δD /2 and (A + B) is still a positive operator. Let P D (A) (resp. P D (A + B)) denote the orthogonal projector onto the subspace spanned by the first D eigenvectors A (resp. (A + B)). Then these satisfy: P D (A) − P D (A + B) ≤ B . δD (1) This then gives rise to our main result on KPCA: Theorem 4 Assume that supx∈X k(x, x) ≤ M . Let SD , SD be the subspaces spanned by the first D eigenvectors of C, resp. Cn defined earlier. Denoting λ1 > λ2 > . . . the 1 eigenvalues of C, if D > 0 is such that λD > 0, put δD = 2 (λD − λD+1 ) and BD = 2M δD 1+ ξ 2 . 2 Then provided that n ≥ BD , the following bound holds with probability at least 1 − e−ξ : BD P SD − P SD ≤ √ . b n (2) This entails in particular ⊥ SD ⊂ g + h, g ∈ SD , h ∈ SD , h 1 Hk ≤ 2BD n− 2 g Hk . (3) The important point here is that the approximation error now only depends on D through the (inverse) gap between the D-th and (D + 1)-th eigenvalues. Note that using the results of section 2, we would have obtained exactly the same bound for estimating the D-th eigenvector only – or even a worse bound since δD = δD ∧ δD−1 appears in this case. Thus, at least from the point of view of this technique (which could still yield suboptimal bounds), there is no increase of complexity between estimating the D-th eigenvector alone and estimating the span of the first D eigenvectors. Note that the inclusion (3) can be interpreted geometrically by saying that for any vector in SD , the √ tangent of the angle between this vector and its projection on SD is upper bounded by BD / n, which we can interpret as a stability property. Comment about the Centered Case. In the actual (K)PCA procedure, the data is actually first empirically recentered, so that one has to consider the centered covariance operator C and its empirical counterpart C n . A result similar to Theorem 4 also holds in this case (up to some additional constant factors). Indeed, a result similar to Lemma 1 holds for the recentered operators [2]. Combined again with Theorem 3, this allows to come to similar conclusions for the “true” centered KPCA. 4 Conclusion and Discussion In this paper, finite sample size confidence bounds of the eigenspaces of Kernel-PCA (the D-eigenspaces of the empirical covariance operator) are provided using tools of operator perturbation theory. This provides a first step towards an in-depth complexity analysis of algorithms using KPCA as pre-processing, and towards taking into account the randomness of the obtained models (e.g. [3]). We proved a bound in which the complexity factor for estimating the eigenspace SD by its empirical counterpart depends only on the inverse gap between the D-th and (D + 1)-th eigenvalues. In addition to the previously cited works, we take into account the centering of the data and obtain comparable rates. In this work we assumed for simplicity of notation the eigenvalues to be simple. In the case the covariance operator C has nonzero eigenvalues with multiplicities m1 , m2 , . . . possibly larger than one, the analysis remains the same except for one point: we have to assume that the dimension D of the subspaces considered is of the form m1 + · · · + mr for a certain r. This could seem restrictive in comparison with the results obtained for estimating the sum of the first D eigenvalues themselves [2] (which is linked to the reconstruction error in KPCA) where no such restriction appears. However, it should be clear that we need this restriction when considering D−eigenspaces themselves since the target space has to be unequivocally defined, otherwise convergence cannot occur. Thus, it can happen in this special case that the reconstruction error converges while the projection space itself does not. Finally, a common point of the two analyses (over the spectrum and over the eigenspaces) lies in the fact that the bounds involve an inverse gap in the eigenvalues of the true covariance operator. Finally, how tight are these bounds and do they at least carry some correct qualitative information about the behavior of the eigenspaces? Asymptotic results (central limit Theorems) in [6, 4] always provide the correct goal to shoot for since they actually give the limit distributions of these quantities. They imply that there is still important ground to cover before bridging the gap between asymptotic and non-asymptotic. This of course opens directions for future work. Acknowledgements: This work was supported in part by the PASCAL Network of Excellence (EU # 506778). A Appendix: proofs. Proof of Lemma 1. This lemma is proved in [9]. We give a short proof for the sake of n 1 completness. Cn − C = n i=1 CXi − E [CX ] with CX = ϕ(X) ⊗ ϕ(X)∗ = k(X, X) ≤ M . We can apply the bounded difference inequality to the variable Cn − C , so that with probability greater than 1 − e−ξ , Cn − C ≤ E [ Cn − C ] + 2M Moreover, by Jensen’s inequality E [ Cn − C ] ≤ E n 1 simple calculations leads to E n i=1 CXi − E [CX ] 2 4M n . This concludes the proof of lemma 1. 1 n 2 n i=1 CXi 1 = nE 2 ξ 2n . 1 2 − E [CX ] , and CX − E [CX ] 2 ≤ Proof of Theorem 3. The variation of this proof with respect to Theorem 5.2 in [7] is (a) to work directly in a (infinite-dimensional) Hilbert space, requiring extra caution for some details and (b) obtaining an improved bound by considering D-eigenspaces at once. The key property of Hilbert-Schmidt operators allowing to work directly in a infinite dimensional setting is that HS(H) is a both right and left ideal of Lc (H, H), the Banach space of all continuous linear operators of H endowed with the operator norm . op . Indeed, ∀ T ∈ HS(H), ∀S ∈ Lc (H, H), T S and ST belong to HS(H) with TS ≤ T S ST ≤ T and op S op . (4) The spectrum of an Hilbert-Schmidt operator T is denoted Λ(T ) and the sequence of eigenvalues in non-increasing order is denoted λ(T ) = (λ1 (T ) ≥ λ2 (T ) ≥ . . .) . In the following, P D (T ) denotes the orthogonal projector onto the D-eigenspace of T . The Hoffmann-Wielandt inequality in infinite dimensional setting[1] yields that: λ(A) − λ(A + B) 2 ≤ B ≤ δD . 2 (5) implying in particular that ∀i > 0, |λi (A) − λi (A + B)| ≤ δD . 2 (6) Results found in [5] p.39 yield the formula P D (A) − P D (A + B) = − 1 2iπ γ (RA (z) − RA+B (z))dz ∈ Lc (H, H) . (7) where RA (z) = (A − z Id)−1 is the resolvent of A, provided that γ is a simple closed curve in C enclosing exactly the first D eigenvalues of A and (A + B). Moreover, the same reference (p.60) states that for ξ in the complementary of Λ(A), RA (ξ) op = dist(ξ, Λ(A)) −1 . (8) The proof of the theorem now relies on the simple choice for the closed curve γ in (7), drawn in the picture below and consisting of three straight lines and a semi-circle of radius D L. For all L > δ2 , γ intersect neither the eigenspectrum of A (by equation (6)) nor the eigenspectrum of A + B. Moreover, the eigenvalues of A (resp. A + B) enclosed by γ are exactly λ1 (A), . . . , λD (A) (resp. λ1 (A + B), . . . , λD (A + B)). Moreover, for z ∈ γ, T (z) = RA (z) − RA+B (z) = −RA+B (z)BRA (z) belongs to HS(H) and depends continuously on z by (4). Consequently, P D (A) − P D (A + B) ≤ 1 2π b a (RA − RA+B )(γ(t)) |γ (t)|dt . N Let SN = n=0 (−1)n (RA (z)B)n RA (z). RA+B (z) = (Id + RA (z)B)−1 RA (z) and, for z ∈ γ and L > δD , RA (z)B op ≤ RA (z) op B ≤ δD 1 ≤ , 2 dist(z, Λ(A)) 2 γ L L δD λ 0 D+1 δD λ2 λD λ1 δD δD δD 2 2 2 L . op imply that SN −→ RA+B (z) (uniformly for z ∈ γ). Using property (4), since B ∈ . HS(H), SN BRA (z) −→ RA+B (z)BRA (z) = RA+B (z) − RA (z) . Finally, RA (z) − RA+B (z) = (−1)n (RA (z)B)n RA (z) n≥1 where the series converges in HS(H), uniformly in z ∈ γ. Using again property (4) and (8) implies B n (RA − RA+B )(γ(t)) ≤ RA (γ(t)) n+1 B n ≤ op distn+1 (γ(t), Λ(A)) n≥1 Finally, since for L > δD , B ≤ δD 2 n≥1 ≤ dist(γ(t),Λ(A)) , 2 b B 1 |γ (t)|dt . 2 (γ(t), Λ(A)) π a dist Splitting the last integral into four parts according to the definition of the contour γ, we obtain P D (A) − P D (A + B) ≤ 2arctan( δL ) 1 µ1 (A) − (µD (A) − δD ) π D |γ (t)|dt ≤ + +2 , dist2 (γ(t), Λ(A)) δD L L2 a and letting L goes to infinity leads to the result. b Proof of Theorem 4. Lemma 1 and Theorem 3 yield inequality (2). Together with as1 2 sumption n ≥ BD it implies PSD − PSD ≤ 2 . Let f ∈ SD : f = PSD (f ) + PSD (f ) . ⊥ b Lemma 5 below with F = SD and G = SD , and the fact that the operator norm is bounded by the Hilbert-Schmidt norm imply that 4 PSD (f ) 2 k ≤ PSD − PSD 2 PSD (f ) 2 k . ⊥ b H H 3 Gathering the different inequalities, Theorem 4 is proved. Lemma 5 Let F and G be two vector subspaces of H such that PF − PG the following bound holds: 4 PF − PG 2 PF (f ) 2 . ∀ f ∈ G , PF ⊥ (f ) 2 ≤ H op H 3 op 1 ≤ 2 . Then Proof of Lemma 5. = f − PF (f ) 2 = (PG − PF )(f ) 2 = PF − PF ⊥ (f ) 2 For f ∈ G, we have PG (f ) = f , hence PF (f ) 2 op PG 2 op ≤ P F − PG gathering the terms containing PF ⊥ (f ) 1/4 leads to the conclusion. 2 f 2 2 + PF ⊥ (f ) 2 on the left-hand side and using PF −PG 2 op ≤ References [1] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in infinite dimensions. Proc.Indian Acad.Sci(Math. Sci.) 104 (3), p. 483-494, 1994. [2] G. Blanchard, O. Bousquet, and L. Zwald. Statistical Properties of Kernel Principal Component Analysis. Proceedings of the 17th. Conference on Learning Theory (COLT 2004), p. 594–608. Springer, 2004. [3] G. Blanchard, P. Massart, R. Vert, and L. Zwald. Kernel projection machine: a new tool for pattern recognition. Proceedings of the 18th. Neural Information Processing System (NIPS 2004), p. 1649–1656. MIT Press, 2004. [4] J. Dauxois, A. Pousse, and Y. Romain. Asymptotic theory for the Principal Component Analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis 12, 136-154, 1982. [5] T. Kato. Perturbation Theory for Linear Operators. New-York: Springer-Verlag, 1966. [6] V. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191–227, 1998. [7] V. Koltchinskii and E. Gin´ . Random matrix approximation of spectra of integral e operators. Bernoulli, 6(1):113–167, 2000. [8] B. Sch¨ lkopf, A. J. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a o u kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [9] J. Shawe-Taylor and N. Cristianini. Estimating the moments of a random vector with applications. Proceedings of the GRETSI 2003 Conference, p. 47-52, 2003. [10] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix and the generalisation error of Kernel PCA. IEEE Transactions on Information Theory 51 (7), p. 2510-2522, 2005. [11] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Technical Report 134, Max Planck Institute for Biological Cybernetics, 2004. [12] U. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering on random samples: the normalized case. Proceedings of the 17th Annual Conference on Learning Theory (COLT 2004), p. 457–471. Springer, 2004. [13] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. Proceedings of the 17th International Conference on Machine Learning (ICML), p. 1159–1166. Morgan Kaufmann, 2000.
2 0.71494019 134 nips-2005-Neural mechanisms of contrast dependent receptive field size in V1
Author: Jim Wielaard, Paul Sajda
Abstract: Based on a large scale spiking neuron model of the input layers 4Cα and β of macaque, we identify neural mechanisms for the observed contrast dependent receptive field size of V1 cells. We observe a rich variety of mechanisms for the phenomenon and analyze them based on the relative gain of excitatory and inhibitory synaptic inputs. We observe an average growth in the spatial extent of excitation and inhibition for low contrast, as predicted from phenomenological models. However, contrary to phenomenological models, our simulation results suggest this is neither sufficient nor necessary to explain the phenomenon.
3 0.44296739 70 nips-2005-Fast Information Value for Graphical Models
Author: Brigham S. Anderson, Andrew W. Moore
Abstract: Calculations that quantify the dependencies between variables are vital to many operations with graphical models, e.g., active learning and sensitivity analysis. Previously, pairwise information gain calculation has involved a cost quadratic in network size. In this work, we show how to perform a similar computation with cost linear in network size. The loss function that allows this is of a form amenable to computation by dynamic programming. The message-passing algorithm that results is described and empirical results demonstrate large speedups without decrease in accuracy. In the cost-sensitive domains examined, superior accuracy is achieved.
4 0.38313803 182 nips-2005-Statistical Convergence of Kernel CCA
Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach
Abstract: While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence. 1
5 0.36003566 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman
Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1
6 0.34601003 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
7 0.33132729 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
8 0.29518586 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
9 0.28811225 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
10 0.28114063 85 nips-2005-Generalization to Unseen Cases
11 0.27172801 126 nips-2005-Metric Learning by Collapsing Classes
12 0.27103952 47 nips-2005-Consistency of one-class SVM and related algorithms
13 0.26968542 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes
14 0.26501206 112 nips-2005-Learning Minimum Volume Sets
15 0.25633207 95 nips-2005-Improved risk tail bounds for on-line algorithms
16 0.2550897 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
17 0.25010014 71 nips-2005-Fast Krylov Methods for N-Body Learning
18 0.23646742 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
19 0.23214902 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains
20 0.23010987 2 nips-2005-A Bayes Rule for Density Matrices
topicId topicWeight
[(3, 0.085), (10, 0.031), (18, 0.359), (27, 0.023), (31, 0.029), (34, 0.095), (50, 0.016), (55, 0.04), (69, 0.036), (73, 0.054), (88, 0.077), (91, 0.048)]
simIndex simValue paperId paperTitle
1 0.8752116 125 nips-2005-Message passing for task redistribution on sparse graphs
Author: K. Wong, Zhuo Gao, David Tax
Abstract: The problem of resource allocation in sparse graphs with real variables is studied using methods of statistical physics. An efficient distributed algorithm is devised on the basis of insight gained from the analysis and is examined using numerical simulations, showing excellent performance and full agreement with the theoretical results.
same-paper 2 0.8307687 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
Author: Laurent Zwald, Gilles Blanchard
Abstract: This paper presents a non-asymptotic statistical analysis of Kernel-PCA with a focus different from the one proposed in previous work on this topic. Here instead of considering the reconstruction error of KPCA we are interested in approximation error bounds for the eigenspaces themselves. We prove an upper bound depending on the spacing between eigenvalues but not on the dimensionality of the eigenspace. As a consequence this allows to infer stability results for these estimated spaces. 1 Introduction. Principal Component Analysis (PCA for short in the sequel) is a widely used tool for data dimensionality reduction. It consists in finding the most relevant lower-dimension projection of some data in the sense that the projection should keep as much of the variance of the original data as possible. If the target dimensionality of the projected data is fixed in advance, say D – an assumption that we will make throughout the present paper – the solution of this problem is obtained by considering the projection on the span SD of the first D eigenvectors of the covariance matrix. Here by ’first D eigenvectors’ we mean eigenvectors associated to the D largest eigenvalues counted with multiplicity; hereafter with some abuse the span of the first D eigenvectors will be called “D-eigenspace” for short when there is no risk of confusion. The introduction of the ’Kernel trick’ has allowed to extend this methodology to data mapped in a kernel feature space, then called KPCA [8]. The interest of this extension is that, while still linear in feature space, it gives rise to nonlinear interpretation in original space – vectors in the kernel feature space can be interpreted as nonlinear functions on the original space. For PCA as well as KPCA, the true covariance matrix (resp. covariance operator) is not known and has to be estimated from the available data, an procedure which in the case of ¨ Kernel spaces is linked to the so-called Nystrom approximation [13]. The subspace given as an output is then obtained as D-eigenspace SD of the empirical covariance matrix or operator. An interesting question from a statistical or learning theoretical point of view is then, how reliable this estimate is. This question has already been studied [10, 2] from the point of view of the reconstruction error of the estimated subspace. What this means is that (assuming the data is centered in Kernel space for simplicity) the average reconstruction error (square norm of the distance to the projection) of SD converges to the (optimal) reconstruction error of SD and that bounds are known about the rate of convergence. However, this does not tell us much about the convergence of SD to SD – since two very different subspaces can have a very similar reconstruction error, in particular when some eigenvalues are very close to each other (the gap between the eigenvalues will actually appear as a central point of the analysis to come). In the present work, we set to study the behavior of these D-eigenspaces themselves: we provide finite sample bounds describing the closeness of the D-eigenspaces of the empirical covariance operator to the true one. There are several broad motivations for this analysis. First, the reconstruction error alone is a valid criterion only if one really plans to perform dimensionality reduction of the data and stop there. However, PCA is often used merely as a preprocessing step and the projected data is then submitted to further processing (which could be classification, regression or something else). In particular for KPCA, the projection subspace in the kernel space can be interpreted as a subspace of functions on the original space; one then expects these functions to be relevant for the data at hand and for some further task (see e.g. [3]). In these cases, if we want to analyze the full procedure (from a learning theoretical sense), it is desirable to have a more precise information on the selected subspace than just its reconstruction error. In particular, from a learning complexity point of view, it is important to ensure that functions used for learning stay in a set of limited complexity, which is ensured if the selected subspace is stable (which is a consequence of its convergence). The approach we use here is based on perturbation bounds and we essentially walk in the steps pioneered by Kolchinskii and Gin´ [7] (see also [4]) using tools of operator perturbae tion theory [5]. Similar methods have been used to prove consistency of spectral clustering [12, 11]. An important difference here is that we want to study directly the convergence of the whole subspace spanned by the first D eigenvectors instead of the separate convergence of the individual eigenvectors; in particular we are interested in how D acts as a complexity parameter. The important point in our main result is that it does not: only the gap between the D-th and the (D + 1)-th eigenvalue comes into account. This means that there in no increase in complexity (as far as this bound is concerned: of course we cannot exclude that better bounds can be obtained in the future) between estimating the D-th eigenvector alone or the span of the first D eigenvectors. Our contribution in the present work is thus • to adapt the operator perturbation result of [7] to D-eigenspaces. • to get non-asymptotic bounds on the approximation error of Kernel-PCA eigenspaces thanks to the previous tool. In section 2 we introduce shortly the notation, explain the main ingredients used and obtain a first bound based on controlling separately the first D eigenvectors, and depending on the dimension D. In section 3 we explain why the first bound is actually suboptimal and derive an improved bound as a consequence of an operator perturbation result that is more adapted to our needs and deals directly with the D-eigenspace as a whole. Section 4 concludes and discusses the obtained results. Mathematical proofs are found in the appendix. 2 First result. Notation. The interest variable X takes its values in some measurable space X , following the distribution P . We consider KPCA and are therefore primarily interested in the mapping of X into a reproducing kernel Hilbert space H with kernel function k through the feature mapping ϕ(x) = k(x, ·). The objective of the kernel PCA procedure is to recover a D-dimensional subspace SD of H such that the projection of ϕ(X) on SD has maximum averaged squared norm. All operators considered in what follows are Hilbert-Schmidt and the norm considered for these operators will be the Hilbert-Schmidt norm unless precised otherwise. Furthermore we only consider symmetric nonnegative operators, so that they can be diagonalized and have a discrete spectrum. Let C denote the covariance operator of variable ϕ(X). To simplify notation we assume that nonzero eigenvalues λ1 > λ2 > . . . of C are all simple (This is for convenience only. In the conclusion we discuss what changes have to be made if this is not the case). Let φ1 , φ2 , . . . be the associated eigenvectors. It is well-known that the optimal D-dimensional reconstruction space is SD = span{φ1 , . . . , φD }. The KPCA procedure approximates this objective by considering the empirical covariance operator, denoted Cn , and the subspace SD spanned by its first D eigenvectors. We denote PSD , PSD the orthogonal projectors on b these spaces. A first bound. Broadly speaking, the main steps required to obtain the type of result we are interested in are 1. A non-asympotic bound on the (Hilbert-Schmidt) norm of the difference between the empirical and the true covariance operators; 2. An operator perturbation result bounding the difference between spectral projectors of two operators by the norm of their difference. The combination of these two steps leads to our goal. The first step consists in the following Lemma coming from [9]: Lemma 1 (Corollary 5 of [9]) Supposing that supx∈X k(x, x) ≤ M , with probability greater than 1 − e−ξ , 2M ξ Cn − C ≤ √ 1+ . 2 n As for the second step, [7] provides the following perturbation bound (see also e.g. [12]): Theorem 2 (Simplified Version of [7], Theorem 5.2 ) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple positive eigenvalues λ 1 > 1 λ2 > . . . For an integer r such that λr > 0, let δr = δr ∧ δr−1 where δr = 2 (λr − λr+1 ). Let B ∈ HS(H) be another symmetric operator such that B < δr /2 and (A + B) is still a positive operator with simple nonzero eigenvalues. Let Pr (A) (resp. Pr (A + B)) denote the orthogonal projector onto the subspace spanned by the r-th eigenvector of A (resp. (A + B)). Then, these projectors satisfy: Pr (A) − Pr (A + B) ≤ 2 B δr . Remark about the Approximation Error of the Eigenvectors: let us recall that a control over the Hilbert-Schmidt norm of the projections onto eigenspaces imply a control on the approximation errors of the eigenvectors themselves. Indeed, let φr , ψr denote the (normalized) r-th eigenvectors of the operators above with signs chosen so that φ r , ψr > 0. Then P φ r − P ψr 2 2 = 2(1 − φr , ψr ) ≥ 2(1 − φr , ψr ) = φr − ψr 2 . Now, the orthogonal projector on the direct sum of the first D eigenspaces is the sum D r=1 Pr . Using the triangle inequality, and combining Lemma 1 and Theorem 2, we conclude that with probability at least 1 − e−ξ the following holds: D P SD − P SD ≤ b −1 δr r=1 4M √ n 1+ ξ 2 , 2 provided that n ≥ 16M 2 1 + ξ 2 −2 (sup1≤r≤D δr ) . The disadvantage of this bound is that we are penalized on the one hand by the (inverse) gaps between the eigenvalues, and on the other by the dimension D (because we have to sum the inverse gaps from 1 to D). In the next section we improve the operator perturbation bound to get an improved result where only the gap δD enters into account. 3 Improved Result. We first prove the following variant on the operator perturbation property which better corresponds to our needs by taking directly into account the projection on the first D eigenvectors at once. The proof uses the same kind of techniques as in [7]. Theorem 3 Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple nonzero eigenvalues λ1 > λ2 > . . . Let D > 0 be an integer such that λD > 0, δD = 1 (λD − λD+1 ). Let B ∈ HS(H) be another symmetric operator such that 2 B < δD /2 and (A + B) is still a positive operator. Let P D (A) (resp. P D (A + B)) denote the orthogonal projector onto the subspace spanned by the first D eigenvectors A (resp. (A + B)). Then these satisfy: P D (A) − P D (A + B) ≤ B . δD (1) This then gives rise to our main result on KPCA: Theorem 4 Assume that supx∈X k(x, x) ≤ M . Let SD , SD be the subspaces spanned by the first D eigenvectors of C, resp. Cn defined earlier. Denoting λ1 > λ2 > . . . the 1 eigenvalues of C, if D > 0 is such that λD > 0, put δD = 2 (λD − λD+1 ) and BD = 2M δD 1+ ξ 2 . 2 Then provided that n ≥ BD , the following bound holds with probability at least 1 − e−ξ : BD P SD − P SD ≤ √ . b n (2) This entails in particular ⊥ SD ⊂ g + h, g ∈ SD , h ∈ SD , h 1 Hk ≤ 2BD n− 2 g Hk . (3) The important point here is that the approximation error now only depends on D through the (inverse) gap between the D-th and (D + 1)-th eigenvalues. Note that using the results of section 2, we would have obtained exactly the same bound for estimating the D-th eigenvector only – or even a worse bound since δD = δD ∧ δD−1 appears in this case. Thus, at least from the point of view of this technique (which could still yield suboptimal bounds), there is no increase of complexity between estimating the D-th eigenvector alone and estimating the span of the first D eigenvectors. Note that the inclusion (3) can be interpreted geometrically by saying that for any vector in SD , the √ tangent of the angle between this vector and its projection on SD is upper bounded by BD / n, which we can interpret as a stability property. Comment about the Centered Case. In the actual (K)PCA procedure, the data is actually first empirically recentered, so that one has to consider the centered covariance operator C and its empirical counterpart C n . A result similar to Theorem 4 also holds in this case (up to some additional constant factors). Indeed, a result similar to Lemma 1 holds for the recentered operators [2]. Combined again with Theorem 3, this allows to come to similar conclusions for the “true” centered KPCA. 4 Conclusion and Discussion In this paper, finite sample size confidence bounds of the eigenspaces of Kernel-PCA (the D-eigenspaces of the empirical covariance operator) are provided using tools of operator perturbation theory. This provides a first step towards an in-depth complexity analysis of algorithms using KPCA as pre-processing, and towards taking into account the randomness of the obtained models (e.g. [3]). We proved a bound in which the complexity factor for estimating the eigenspace SD by its empirical counterpart depends only on the inverse gap between the D-th and (D + 1)-th eigenvalues. In addition to the previously cited works, we take into account the centering of the data and obtain comparable rates. In this work we assumed for simplicity of notation the eigenvalues to be simple. In the case the covariance operator C has nonzero eigenvalues with multiplicities m1 , m2 , . . . possibly larger than one, the analysis remains the same except for one point: we have to assume that the dimension D of the subspaces considered is of the form m1 + · · · + mr for a certain r. This could seem restrictive in comparison with the results obtained for estimating the sum of the first D eigenvalues themselves [2] (which is linked to the reconstruction error in KPCA) where no such restriction appears. However, it should be clear that we need this restriction when considering D−eigenspaces themselves since the target space has to be unequivocally defined, otherwise convergence cannot occur. Thus, it can happen in this special case that the reconstruction error converges while the projection space itself does not. Finally, a common point of the two analyses (over the spectrum and over the eigenspaces) lies in the fact that the bounds involve an inverse gap in the eigenvalues of the true covariance operator. Finally, how tight are these bounds and do they at least carry some correct qualitative information about the behavior of the eigenspaces? Asymptotic results (central limit Theorems) in [6, 4] always provide the correct goal to shoot for since they actually give the limit distributions of these quantities. They imply that there is still important ground to cover before bridging the gap between asymptotic and non-asymptotic. This of course opens directions for future work. Acknowledgements: This work was supported in part by the PASCAL Network of Excellence (EU # 506778). A Appendix: proofs. Proof of Lemma 1. This lemma is proved in [9]. We give a short proof for the sake of n 1 completness. Cn − C = n i=1 CXi − E [CX ] with CX = ϕ(X) ⊗ ϕ(X)∗ = k(X, X) ≤ M . We can apply the bounded difference inequality to the variable Cn − C , so that with probability greater than 1 − e−ξ , Cn − C ≤ E [ Cn − C ] + 2M Moreover, by Jensen’s inequality E [ Cn − C ] ≤ E n 1 simple calculations leads to E n i=1 CXi − E [CX ] 2 4M n . This concludes the proof of lemma 1. 1 n 2 n i=1 CXi 1 = nE 2 ξ 2n . 1 2 − E [CX ] , and CX − E [CX ] 2 ≤ Proof of Theorem 3. The variation of this proof with respect to Theorem 5.2 in [7] is (a) to work directly in a (infinite-dimensional) Hilbert space, requiring extra caution for some details and (b) obtaining an improved bound by considering D-eigenspaces at once. The key property of Hilbert-Schmidt operators allowing to work directly in a infinite dimensional setting is that HS(H) is a both right and left ideal of Lc (H, H), the Banach space of all continuous linear operators of H endowed with the operator norm . op . Indeed, ∀ T ∈ HS(H), ∀S ∈ Lc (H, H), T S and ST belong to HS(H) with TS ≤ T S ST ≤ T and op S op . (4) The spectrum of an Hilbert-Schmidt operator T is denoted Λ(T ) and the sequence of eigenvalues in non-increasing order is denoted λ(T ) = (λ1 (T ) ≥ λ2 (T ) ≥ . . .) . In the following, P D (T ) denotes the orthogonal projector onto the D-eigenspace of T . The Hoffmann-Wielandt inequality in infinite dimensional setting[1] yields that: λ(A) − λ(A + B) 2 ≤ B ≤ δD . 2 (5) implying in particular that ∀i > 0, |λi (A) − λi (A + B)| ≤ δD . 2 (6) Results found in [5] p.39 yield the formula P D (A) − P D (A + B) = − 1 2iπ γ (RA (z) − RA+B (z))dz ∈ Lc (H, H) . (7) where RA (z) = (A − z Id)−1 is the resolvent of A, provided that γ is a simple closed curve in C enclosing exactly the first D eigenvalues of A and (A + B). Moreover, the same reference (p.60) states that for ξ in the complementary of Λ(A), RA (ξ) op = dist(ξ, Λ(A)) −1 . (8) The proof of the theorem now relies on the simple choice for the closed curve γ in (7), drawn in the picture below and consisting of three straight lines and a semi-circle of radius D L. For all L > δ2 , γ intersect neither the eigenspectrum of A (by equation (6)) nor the eigenspectrum of A + B. Moreover, the eigenvalues of A (resp. A + B) enclosed by γ are exactly λ1 (A), . . . , λD (A) (resp. λ1 (A + B), . . . , λD (A + B)). Moreover, for z ∈ γ, T (z) = RA (z) − RA+B (z) = −RA+B (z)BRA (z) belongs to HS(H) and depends continuously on z by (4). Consequently, P D (A) − P D (A + B) ≤ 1 2π b a (RA − RA+B )(γ(t)) |γ (t)|dt . N Let SN = n=0 (−1)n (RA (z)B)n RA (z). RA+B (z) = (Id + RA (z)B)−1 RA (z) and, for z ∈ γ and L > δD , RA (z)B op ≤ RA (z) op B ≤ δD 1 ≤ , 2 dist(z, Λ(A)) 2 γ L L δD λ 0 D+1 δD λ2 λD λ1 δD δD δD 2 2 2 L . op imply that SN −→ RA+B (z) (uniformly for z ∈ γ). Using property (4), since B ∈ . HS(H), SN BRA (z) −→ RA+B (z)BRA (z) = RA+B (z) − RA (z) . Finally, RA (z) − RA+B (z) = (−1)n (RA (z)B)n RA (z) n≥1 where the series converges in HS(H), uniformly in z ∈ γ. Using again property (4) and (8) implies B n (RA − RA+B )(γ(t)) ≤ RA (γ(t)) n+1 B n ≤ op distn+1 (γ(t), Λ(A)) n≥1 Finally, since for L > δD , B ≤ δD 2 n≥1 ≤ dist(γ(t),Λ(A)) , 2 b B 1 |γ (t)|dt . 2 (γ(t), Λ(A)) π a dist Splitting the last integral into four parts according to the definition of the contour γ, we obtain P D (A) − P D (A + B) ≤ 2arctan( δL ) 1 µ1 (A) − (µD (A) − δD ) π D |γ (t)|dt ≤ + +2 , dist2 (γ(t), Λ(A)) δD L L2 a and letting L goes to infinity leads to the result. b Proof of Theorem 4. Lemma 1 and Theorem 3 yield inequality (2). Together with as1 2 sumption n ≥ BD it implies PSD − PSD ≤ 2 . Let f ∈ SD : f = PSD (f ) + PSD (f ) . ⊥ b Lemma 5 below with F = SD and G = SD , and the fact that the operator norm is bounded by the Hilbert-Schmidt norm imply that 4 PSD (f ) 2 k ≤ PSD − PSD 2 PSD (f ) 2 k . ⊥ b H H 3 Gathering the different inequalities, Theorem 4 is proved. Lemma 5 Let F and G be two vector subspaces of H such that PF − PG the following bound holds: 4 PF − PG 2 PF (f ) 2 . ∀ f ∈ G , PF ⊥ (f ) 2 ≤ H op H 3 op 1 ≤ 2 . Then Proof of Lemma 5. = f − PF (f ) 2 = (PG − PF )(f ) 2 = PF − PF ⊥ (f ) 2 For f ∈ G, we have PG (f ) = f , hence PF (f ) 2 op PG 2 op ≤ P F − PG gathering the terms containing PF ⊥ (f ) 1/4 leads to the conclusion. 2 f 2 2 + PF ⊥ (f ) 2 on the left-hand side and using PF −PG 2 op ≤ References [1] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in infinite dimensions. Proc.Indian Acad.Sci(Math. Sci.) 104 (3), p. 483-494, 1994. [2] G. Blanchard, O. Bousquet, and L. Zwald. Statistical Properties of Kernel Principal Component Analysis. Proceedings of the 17th. Conference on Learning Theory (COLT 2004), p. 594–608. Springer, 2004. [3] G. Blanchard, P. Massart, R. Vert, and L. Zwald. Kernel projection machine: a new tool for pattern recognition. Proceedings of the 18th. Neural Information Processing System (NIPS 2004), p. 1649–1656. MIT Press, 2004. [4] J. Dauxois, A. Pousse, and Y. Romain. Asymptotic theory for the Principal Component Analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis 12, 136-154, 1982. [5] T. Kato. Perturbation Theory for Linear Operators. New-York: Springer-Verlag, 1966. [6] V. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191–227, 1998. [7] V. Koltchinskii and E. Gin´ . Random matrix approximation of spectra of integral e operators. Bernoulli, 6(1):113–167, 2000. [8] B. Sch¨ lkopf, A. J. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a o u kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [9] J. Shawe-Taylor and N. Cristianini. Estimating the moments of a random vector with applications. Proceedings of the GRETSI 2003 Conference, p. 47-52, 2003. [10] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix and the generalisation error of Kernel PCA. IEEE Transactions on Information Theory 51 (7), p. 2510-2522, 2005. [11] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Technical Report 134, Max Planck Institute for Biological Cybernetics, 2004. [12] U. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering on random samples: the normalized case. Proceedings of the 17th Annual Conference on Learning Theory (COLT 2004), p. 457–471. Springer, 2004. [13] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. Proceedings of the 17th International Conference on Machine Learning (ICML), p. 1159–1166. Morgan Kaufmann, 2000.
3 0.59960544 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
Author: Edward Snelson, Zoubin Ghahramani
Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M , i.e. very sparse solutions, and it significantly outperforms other approaches in this regime. 1
4 0.47167081 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
Author: Manfred Opper
Abstract: The problem of computing a resample estimate for the reconstruction error in PCA is reformulated as an inference problem with the help of the replica method. Using the expectation consistent (EC) approximation, the intractable inference problem can be solved efficiently using only two variational parameters. A perturbative correction to the result is computed and an alternative simplified derivation is also presented. 1
5 0.44369793 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
Author: Sathiya Keerthi, Wei Chu
Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1
6 0.4425391 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
7 0.43833584 182 nips-2005-Statistical Convergence of Kernel CCA
8 0.4318381 112 nips-2005-Learning Minimum Volume Sets
9 0.43108216 177 nips-2005-Size Regularized Cut for Data Clustering
10 0.42688355 62 nips-2005-Efficient Estimation of OOMs
11 0.42497492 50 nips-2005-Convex Neural Networks
12 0.42464551 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
13 0.42461678 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
14 0.42454123 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
15 0.42363104 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
16 0.42327225 160 nips-2005-Query by Committee Made Real
17 0.42272785 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
18 0.42211899 58 nips-2005-Divergences, surrogate loss functions and experimental design
19 0.42182913 184 nips-2005-Structured Prediction via the Extragradient Method
20 0.4211525 74 nips-2005-Faster Rates in Regression via Active Learning