nips nips2005 nips2005-182 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach
Abstract: While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. [sent-9, score-0.27]
2 This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. [sent-10, score-0.205]
3 The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence. [sent-11, score-0.062]
4 1 Introduction Kernel canonical correlation analysis (kernel CCA) has been proposed as a nonlinear extension of CCA [1, 11, 3]. [sent-12, score-0.084]
5 Given two random variables, kernel CCA aims at extracting the information which is shared by the two random variables, and has been successfully applied in various practical contexts. [sent-13, score-0.083]
6 More precisely, given two random variables X and Y , the purpose of kernel CCA is to provide nonlinear mappings f (X) and g(Y ) such that their correlation is maximized. [sent-14, score-0.143]
7 As in many statistical methods, the desired functions are in practice estimated from a finite sample. [sent-15, score-0.048]
8 Thus, the convergence of the estimated functions to the population ones with increasing sample size is very important to justify the method. [sent-16, score-0.115]
9 Since the goal of kernel CCA is to estimate a pair of functions, the convergence should be evaluated in an appropriate functional norm: thus, we need tools from functional analysis to characterize the type of convergence. [sent-17, score-0.157]
10 The purpose of this paper is to rigorously prove the statistical convergence of kernel CCA, and of a related method. [sent-18, score-0.173]
11 Both kernel CCA and NOCCO require a regularization coefficient to enforce smoothness of the functions in the finite sample case (thus avoiding a trivial solution), but the decay of this regularisation with increased sample size has not yet been established. [sent-20, score-0.196]
12 Our main theorems give a sufficient condition on the decay of the regularization coefficient for the finite sample estimates to converge to the desired functions in the population limit. [sent-21, score-0.164]
13 Another important issue in establishing the convergence is an appropriate distance measure for functions. [sent-22, score-0.074]
14 For NOCCO, we obtain convergence in the norm of reproducing kernel Hilbert spaces (RKHS) [2]. [sent-23, score-0.252]
15 This norm is very strong: if the positive definite (p. [sent-24, score-0.072]
16 ) kernels are continuous and bounded, it is stronger than the uniform norm in the space of continuous functions, and thus the estimated functions converge uniformly to the desired ones. [sent-26, score-0.139]
17 For kernel CCA, we show convergence in the L2 norm, which is a standard distance measure for functions. [sent-27, score-0.157]
18 2 Kernel CCA and related methods In this section, we review kernel CCA as presented by [3], and then formulate it with covariance operators on RKHS. [sent-29, score-0.182]
19 In this paper, a Hilbert space always refers to a separable Hilbert space, and an operator to a linear operator. [sent-30, score-0.096]
20 T denotes the operator norm sup ϕ =1 T ϕ , and R(T ) denotes the range of an operator T . [sent-31, score-0.264]
21 Throughout this paper, (HX , kX ) and (HY , kY ) are RKHS of functions on measurable spaces X and Y, respectively, with measurable p. [sent-32, score-0.084]
22 (1) Note that under this assumption it is easy to see HX and HY are continuously included in L2 (PX ) and L2 (PY ), respectively, where L2 (µ) denotes the Hilbert space of square integrable functions with respect to the measure µ. [sent-38, score-0.051]
23 1 CCA in reproducing kernel Hilbert spaces Classical CCA provides the linear mappings aT X and bT Y that achieve maximum correlation. [sent-40, score-0.138]
24 More precisely, kernel CCA solves Cov[f (X), g(Y )] max . [sent-42, score-0.099]
25 , (Xn , Yn ) from PXY , an empirical solution of Eq. [sent-50, score-0.022]
26 (2) is max f ∈HX ,g∈HY Cov[f (X), g(Y )] Var[f (X)] + εn f 1/2 2 HX Var[g(Y )] + εn g 1/2 2 HY , (3) where Cov and Var denote the empirical covariance and variance, such as 1 n 1 n 1 n f (Xi ) − f (Xj ) g(Yi ) − g(Yj ) . [sent-51, score-0.07]
27 Cov[f (X), g(Y )] = n i=1 n j=1 n j=1 The positive constant εn is a regularization coefficient. [sent-52, score-0.037]
28 As we shall see, the regularization terms εn f 2 X and εn g 2 Y make the problem well-formulated statistically, H H enforce smoothness, and enable operator inversion, as in Tikhonov regularization. [sent-53, score-0.13]
29 2 Representation with cross-covariance operators Kernel CCA and related methods can be formulated using covariance operators [4, 7, 8], which make theoretical discussions easier. [sent-55, score-0.166]
30 It is known that there exists a unique cross-covariance operator ΣY X : HX → HY for (X, Y ) such that g, ΣY X f HY = EXY (f (X)−EX [f (X)])(g(Y )−EY [g(Y )]) (= Cov[f (X), g(Y )]) holds for all f ∈ HX and g ∈ HY . [sent-56, score-0.096]
31 The cross covariance operator represents the covariance of f (X) and g(Y ) as a bilinear form of f and g. [sent-57, score-0.16]
32 In particular, if Y is equal to X, the self-adjoint operator ΣXX is called the covariance operator. [sent-58, score-0.139]
33 (n) The empirical cross-covariance operator ΣY X is defined by the cross-covariance n 1 operator with the empirical distribution n i=1 δXi δYi . [sent-66, score-0.236]
34 By definition, for any f ∈ (n) HX and g ∈ HY , the operator ΣY X gives the empirical covariance as follows; (n) g, ΣY X f HY = Cov[f (X), g(Y )]. [sent-67, score-0.15]
35 It is known [4] that ΣY X can be represented as 1/2 1/2 ΣY X = ΣY Y VY X ΣXX , (4) where VY X : HX → HY is a unique bounded operator such that VY X ≤ 1 −1/2 −1/2 and VY X = QY VY X QX . [sent-69, score-0.119]
36 With cross-covariance operators, the kernel CCA problem can be formulated as sup f ∈HX ,g∈HY g, ΣY X f f, ΣXX f HX = 1, g, ΣY Y g HY = 1. [sent-71, score-0.099]
37 (5) is given by the eigenfunctions corresponding to the largest eigenvalue of the following generalized eigenproblem: O ΣY X ΣXY O f g ΣXX O = ρ1 f . [sent-73, score-0.057]
38 g O ΣY Y (6) Similarly, the empirical estimator in Eq. [sent-74, score-0.022]
39 subject to (7) Let us assume that the operator VY X is compact,1 and let φ and ψ be the unit eigenfunctions of VY X corresponding to the largest singular value; that is, ψ, VY X φ HY = max f ∈HX ,g∈HY , f HX = g HY =1 g, VY X f HY . [sent-76, score-0.213]
40 (8) Given φ ∈ R(ΣXX ) and ψ ∈ R(ΣY Y ), the kernel CCA solution in Eq. [sent-77, score-0.083]
41 (9) In the empirical case, let φn ∈ HX and ψn ∈ HY be the unit eigenfunctions corresponding to the largest singular value of the finite rank operator (n) (n) VY X := ΣY Y + εn I −1/2 (n) (n) ΣY X ΣXX + εn I −1/2 . [sent-79, score-0.219]
42 (7) are equal to (n) fn = (ΣXX + εn I)−1/2 φn , 1 (n) gn = (ΣY Y + εn I)−1/2 ψn . [sent-82, score-0.262]
43 (11) A bounded operator T : H1 → H2 is called compact if any bounded sequence {un } ⊂ H1 has a subsequence {un′ } such that T un′ converges in H2 . [sent-83, score-0.197]
44 One of the useful properties of a compact operator is that it admits a singular value decomposition (see [5, 6]) Note that all the above empirical operators and the estimators can be expressed in terms of Gram matrices. [sent-84, score-0.247]
45 The solutions fn and gn are exactly the same as those n 1 given in [3], and are obtained by linear combinations of kX (·, Xi )− n j=1 kX (·, Xj ) and kY (·, Yi ) − n j=1 kY (·, Yj ). [sent-85, score-0.262]
46 The constrained covariance (COCO) [9] uses the unit eigenfunctions of ΣY X ; max f f ∈HX ,g∈HY HX = g HY =1 g, ΣY X f HY = max f f ∈HX ,g∈HY HX = g HY =1 Cov[f (X), g(Y )]. [sent-88, score-0.121]
47 The statistical convergence of COCO has been proved in [8]. [sent-89, score-0.09]
48 Instead of normalizing the covariance by the variances, COCO normalizes it by the RKHS norms of f and g. [sent-90, score-0.032]
49 On the other hand, kernel CCA may encounter situations where it finds functions with moderately large covariance but very small variance for f (X) or g(Y ), since ΣXX and ΣY Y can have arbitrarily small eigenvalues. [sent-93, score-0.131]
50 While the statistical meaning of NOCCO is not as direct as kernel CCA, it can incorporate the normalization by ΣXX and ΣY Y . [sent-95, score-0.099]
51 We will establish the convergence of kernel CCA and NOCCO in Section 3. [sent-96, score-0.157]
52 3 Main theorems: convergence of kernel CCA and NOCCO We show the convergence of NOCCO in the RKHS norm, and the kernel CCA in L2 sense. [sent-97, score-0.314]
53 The results may easily be extended to the convergence of the eigenspace corresponding to the m-th largest eigenvalue. [sent-98, score-0.106]
54 Let φ, ψ, φn , and ψn be the unit eigenfunctions of Eqs. [sent-103, score-0.057]
55 The convergence of NOCCO in the RKHS norm is a very strong result. [sent-112, score-0.13]
56 If kX and kY are continuous and bounded, the RKHS norm is stronger than the uniform norm of the continuous functions. [sent-113, score-0.142]
57 In such cases, Theorem 1 implies φn and ψn converge uniformly to φ and ψ, respectively. [sent-114, score-0.021]
58 This uniform convergence is useful in practice, because in many applications the function value at each point is important. [sent-115, score-0.085]
59 For any complete orthonormal systems (CONS) {φi }∞ of HX and {ψi }∞ of HY , i=1 i=1 −1/2 the compactness assumption on VY X requires that the correlation of ΣXX φi (X) −1/2 and ΣY Y ψi (Y ) decay to zero as i → ∞. [sent-116, score-0.055]
60 Moreover, the kernel CCA does not have solutions, if ΣXX has arbitrarily small eigenvalues. [sent-120, score-0.083]
61 ([10]) discuss CCA on curves, which are represented by stochastic processes on an interval, and use the Sobolev space of functions with square integrable second derivative. [sent-122, score-0.051]
62 Since the Sobolev space is a RKHS, their method is an example of kernel CCA. [sent-123, score-0.083]
63 They also show the convergence of estimators under the condition n1/2 εn → ∞. [sent-124, score-0.11]
64 Although the proof can be extended to a general RKHS, convergence is measured by the correlation, | fn , ΣXX f ( fn , ΣXX fn HX )1/2 ( HX | f, ΣXX f 1/2 HX ) → 1, which is weaker than the L2 convergence in Theorem 2. [sent-125, score-0.75]
65 In fact, using f, ΣXX f HX = 1, it is easy to derive the above convergence from Theorem 2. [sent-126, score-0.074]
66 On the other hand, convergence of the correlation does not necessarily imply (fn − f ), ΣXX (fn − f ) HX → 0. [sent-127, score-0.103]
67 From the equality (fn − f ), ΣXX (fn − f ) HX + 2{1 − fn , ΣXX f = ( fn , ΣXX fn HX /( 1/2 ΣXX fn HX 1/2 1/2 2 HX − f, ΣXX f HX ) 1/2 1/2 ΣXX f HX )} ΣXX fn HX 1/2 ΣXX f HX , we require fn , ΣXX fn HX → f, ΣXX f HX = 1 in order to guarantee the left hand (n) side converges to zero. [sent-128, score-1.35]
68 However, with the normalization fn , (ΣXX + εn I)fn HX = 1, convergence of fn , ΣXX fn HX is not clear. [sent-129, score-0.644]
69 4 Outline of the proof of the main theorems We show only the outline of the proof in this paper. [sent-131, score-0.101]
70 1 Preliminary lemmas We introduce some definitions for our proofs. [sent-134, score-0.036]
71 ∞ An operator T : H1 → H2 is called Hilbert-Schmidt if i=1 T ϕi 2 2 < ∞ for a H CONS {ϕi }∞ of H1 . [sent-136, score-0.107]
72 For Hilbert-Schmidt operators, i=1 the Hilbert-Schmidt norm and inner product are defined as T 2 HS = ∞ i=1 T ϕi 2 H2 , T 1 , T2 HS = ∞ i=1 T 1 ϕ i , T2 ϕ i H2 . [sent-138, score-0.056]
73 For a Hilbert space F, a Borel measurable map F : Ω → F from a measurable space F is called a random element in F. [sent-141, score-0.081]
74 For a random element F in F with E F < ∞, there exists a unique element E[F ] ∈ F, called the expectation of F , such that E[F ], g H = E[ F, g F] (∀g ∈ F) holds. [sent-142, score-0.043]
75 The cross-covariance operator ΣY X is Hilbert-Schmidt, and ΣY X 2 HS 2 . [sent-152, score-0.096]
76 The following lemma shows a much stronger uniform result. [sent-154, score-0.087]
77 HX ⊗HY Using E[Fi ] = E[Gi ] = 0 and E[Fi Gj Fk Gℓ ] = 0 for i = j, {k, ℓ} = {i, j}, we have 2 HS (n) E ΣY X − ΣY X = 1 nE 2 HX ⊗HY FG − 1 n E[F G] The proof is completed by Chebyshev’s inequality. [sent-171, score-0.032]
78 The following two lemmas are essential parts of the proof of the main theorems. [sent-173, score-0.068]
79 The operator on the left hand side is equal to (n) (n) (n) (ΣY Y + εn I)−1/2 − (ΣY Y + εn I)−1/2 ΣY X (ΣXX + εn I)−1/2 (n) (n) + (ΣY Y + εn I)−1/2 ΣY X − ΣY X (ΣXX + εn I)−1/2 (n) + (ΣY Y + εn I)−1/2 ΣY X (ΣXX + εn I)−1/2 − (ΣXX + εn I)−1/2 . [sent-178, score-0.096]
80 √ (n) (n) (n) (n) From (ΣY Y + εn I)−1/2 ≤ 1/ εn , (ΣY Y + εn I)−1/2 ΣY X (ΣXX + εn I)−1/2 ≤ 1 and Lemma 7, the norm of the above operator is upper-bounded by 1 εn √3 εn max ΣY Y 3/2 (n) , ΣY Y 3/2 +1 (n) ΣY Y − ΣY Y . [sent-181, score-0.168]
81 It suffices to prove {(ΣY Y + εn I)−1/2 − ΣY Y }ΣY X (ΣXX + εn I)−1/2 and −1/2 −1/2 ΣY Y ΣY X {(ΣXX + εn I)−1/2 − ΣXX } converge to zero. [sent-189, score-0.021]
82 From ΣXX (fn − f ) 2 X = ΣXX fn 2 X − 2 φ, ΣXX fn HX + φ 2 X , it H H H 1/2 suffices to show ΣXX fn converges to φ in probability. [sent-210, score-0.59]
83 We have 1/2 ΣXX fn − φ + HX 1/2 ΣXX (ΣXX ≤ 1/2 (n) ΣXX {(ΣXX + εn I)−1/2 − (ΣXX + εn I)−1/2 }φn −1/2 + εn I) (φn − φ) HX + 1/2 ΣXX (ΣXX −1/2 + εn I) HX φ−φ HX . [sent-211, score-0.19]
84 Using the same argument as the bound on the first term in Eq. [sent-212, score-0.024]
85 S of the above inequality is shown to converge to zero. [sent-215, score-0.021]
86 Using the assumption φ ∈ R(ΣXX ), the same argument as the proof of Eq. [sent-217, score-0.044]
87 5 Concluding remarks We have established the statistical convergence of kernel CCA and NOCCO, showing that the finite sample estimators of the nonlinear mappings converge to the desired population functions. [sent-219, score-0.287]
88 This convergence is proved in the RKHS norm for NOCCO, and in the L2 norm for kernel CCA. [sent-220, score-0.269]
89 These results give a theoretical justification for using the empirical estimates of NOCCO and kernel CCA in practice. [sent-221, score-0.105]
90 We have also derived a sufficient condition, n1/3 εn → ∞, for the decay of the regularization coefficient εn , which ensures the convergence described above. [sent-222, score-0.121]
91 As [10] suggests, the order of the sufficient condition seems to depend on the function norm used to determine convergence. [sent-223, score-0.071]
92 An interesting consideration is whether the order n1/3 εn → ∞ can be improved for convergence in the L2 or RKHS norm. [sent-224, score-0.074]
93 Another question that remains to be addressed is when to use kernel CCA, COCO, or NOCCO in practice. [sent-225, score-0.083]
94 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-273, score-0.108]
95 Tech Report 140, Max-Planck-Institut f¨r u biologische Kybernetik, 2005. [sent-282, score-0.023]
96 A Lemmas used in the proofs We list the lemmas used in Section 4. [sent-308, score-0.036]
97 Suppose A and B are positive self-adjoint operators on a Hilbert space such that 0 ≤ A ≤ λI and 0 ≤ B ≤ λI hold for a positive constant λ. [sent-311, score-0.099]
98 Suppose An and A are bounded operators on H2 , and B is a compact operator from H1 to H2 such that An u → Au for all u ∈ H0 , and supn An ≤ M for some M > 0. [sent-315, score-0.22]
99 Let A be a compact positive operator on a Hilbert space H, and An (n ∈ N) be bounded positive operators on H such that An converges to A in norm. [sent-318, score-0.262]
100 Assume the eigenspace of A corresponding to the largest eigenvalue is one-dimensional and spanned by a unit eigenvector φ, and the maximum of the spectrum of An is attained by a unit eigenvector φn . [sent-319, score-0.084]
wordName wordTfidf (topN-words)
[('hx', 0.531), ('hy', 0.45), ('xx', 0.359), ('cca', 0.347), ('vy', 0.208), ('fn', 0.19), ('nocco', 0.176), ('kx', 0.131), ('ky', 0.131), ('operator', 0.096), ('kernel', 0.083), ('ey', 0.075), ('rkhs', 0.075), ('convergence', 0.074), ('hs', 0.074), ('gn', 0.072), ('coco', 0.07), ('operators', 0.067), ('hilbert', 0.062), ('lemma', 0.057), ('norm', 0.056), ('fi', 0.05), ('ex', 0.048), ('cov', 0.048), ('fukumizu', 0.047), ('coe', 0.041), ('eigenfunctions', 0.041), ('su', 0.04), ('canonical', 0.04), ('gi', 0.039), ('lemmas', 0.036), ('integrable', 0.035), ('proof', 0.032), ('gj', 0.032), ('covariance', 0.032), ('var', 0.031), ('bach', 0.031), ('gretton', 0.031), ('pxy', 0.031), ('correlation', 0.029), ('measurable', 0.027), ('decay', 0.026), ('py', 0.026), ('fj', 0.025), ('reproducing', 0.025), ('theorems', 0.024), ('compact', 0.024), ('biologische', 0.023), ('cons', 0.023), ('kybernetik', 0.023), ('leurgans', 0.023), ('qx', 0.023), ('qy', 0.023), ('bounded', 0.023), ('empirical', 0.022), ('estimators', 0.021), ('regularization', 0.021), ('px', 0.021), ('converge', 0.021), ('un', 0.02), ('converges', 0.02), ('stronger', 0.019), ('tech', 0.019), ('sobolev', 0.019), ('singular', 0.017), ('mappings', 0.016), ('eigenspace', 0.016), ('functions', 0.016), ('element', 0.016), ('largest', 0.016), ('desired', 0.016), ('unit', 0.016), ('sup', 0.016), ('positive', 0.016), ('completes', 0.016), ('statistical', 0.016), ('max', 0.016), ('society', 0.015), ('condition', 0.015), ('nonlinear', 0.015), ('op', 0.014), ('theorem', 0.014), ('spaces', 0.014), ('nite', 0.014), ('bousquet', 0.013), ('population', 0.013), ('trivial', 0.013), ('enforce', 0.013), ('outline', 0.013), ('goes', 0.012), ('argument', 0.012), ('term', 0.012), ('sample', 0.012), ('ces', 0.011), ('called', 0.011), ('let', 0.011), ('uniform', 0.011), ('eigenvector', 0.01), ('smola', 0.01), ('supn', 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 182 nips-2005-Statistical Convergence of Kernel CCA
Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach
Abstract: While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence. 1
2 0.12567779 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
3 0.11110757 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
Author: Ross Lippert, Ryan Rifkin
Abstract: We consider regularized least-squares (RLS) with a Gaussian kernel. We prove that if we let the Gaussian bandwidth σ → ∞ while letting the regularization parameter λ → 0, the RLS solution tends to a polynomial 1 whose order is controlled by the rielative rates of decay of σ2 and λ: if λ = σ −(2k+1) , then, as σ → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. We illustrate the result with an example. 1
4 0.083041415 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
Author: Larry Wasserman, John D. Lafferty
Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1
5 0.079444811 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
Author: Laurent Zwald, Gilles Blanchard
Abstract: This paper presents a non-asymptotic statistical analysis of Kernel-PCA with a focus different from the one proposed in previous work on this topic. Here instead of considering the reconstruction error of KPCA we are interested in approximation error bounds for the eigenspaces themselves. We prove an upper bound depending on the spacing between eigenvalues but not on the dimensionality of the eigenspace. As a consequence this allows to infer stability results for these estimated spaces. 1 Introduction. Principal Component Analysis (PCA for short in the sequel) is a widely used tool for data dimensionality reduction. It consists in finding the most relevant lower-dimension projection of some data in the sense that the projection should keep as much of the variance of the original data as possible. If the target dimensionality of the projected data is fixed in advance, say D – an assumption that we will make throughout the present paper – the solution of this problem is obtained by considering the projection on the span SD of the first D eigenvectors of the covariance matrix. Here by ’first D eigenvectors’ we mean eigenvectors associated to the D largest eigenvalues counted with multiplicity; hereafter with some abuse the span of the first D eigenvectors will be called “D-eigenspace” for short when there is no risk of confusion. The introduction of the ’Kernel trick’ has allowed to extend this methodology to data mapped in a kernel feature space, then called KPCA [8]. The interest of this extension is that, while still linear in feature space, it gives rise to nonlinear interpretation in original space – vectors in the kernel feature space can be interpreted as nonlinear functions on the original space. For PCA as well as KPCA, the true covariance matrix (resp. covariance operator) is not known and has to be estimated from the available data, an procedure which in the case of ¨ Kernel spaces is linked to the so-called Nystrom approximation [13]. The subspace given as an output is then obtained as D-eigenspace SD of the empirical covariance matrix or operator. An interesting question from a statistical or learning theoretical point of view is then, how reliable this estimate is. This question has already been studied [10, 2] from the point of view of the reconstruction error of the estimated subspace. What this means is that (assuming the data is centered in Kernel space for simplicity) the average reconstruction error (square norm of the distance to the projection) of SD converges to the (optimal) reconstruction error of SD and that bounds are known about the rate of convergence. However, this does not tell us much about the convergence of SD to SD – since two very different subspaces can have a very similar reconstruction error, in particular when some eigenvalues are very close to each other (the gap between the eigenvalues will actually appear as a central point of the analysis to come). In the present work, we set to study the behavior of these D-eigenspaces themselves: we provide finite sample bounds describing the closeness of the D-eigenspaces of the empirical covariance operator to the true one. There are several broad motivations for this analysis. First, the reconstruction error alone is a valid criterion only if one really plans to perform dimensionality reduction of the data and stop there. However, PCA is often used merely as a preprocessing step and the projected data is then submitted to further processing (which could be classification, regression or something else). In particular for KPCA, the projection subspace in the kernel space can be interpreted as a subspace of functions on the original space; one then expects these functions to be relevant for the data at hand and for some further task (see e.g. [3]). In these cases, if we want to analyze the full procedure (from a learning theoretical sense), it is desirable to have a more precise information on the selected subspace than just its reconstruction error. In particular, from a learning complexity point of view, it is important to ensure that functions used for learning stay in a set of limited complexity, which is ensured if the selected subspace is stable (which is a consequence of its convergence). The approach we use here is based on perturbation bounds and we essentially walk in the steps pioneered by Kolchinskii and Gin´ [7] (see also [4]) using tools of operator perturbae tion theory [5]. Similar methods have been used to prove consistency of spectral clustering [12, 11]. An important difference here is that we want to study directly the convergence of the whole subspace spanned by the first D eigenvectors instead of the separate convergence of the individual eigenvectors; in particular we are interested in how D acts as a complexity parameter. The important point in our main result is that it does not: only the gap between the D-th and the (D + 1)-th eigenvalue comes into account. This means that there in no increase in complexity (as far as this bound is concerned: of course we cannot exclude that better bounds can be obtained in the future) between estimating the D-th eigenvector alone or the span of the first D eigenvectors. Our contribution in the present work is thus • to adapt the operator perturbation result of [7] to D-eigenspaces. • to get non-asymptotic bounds on the approximation error of Kernel-PCA eigenspaces thanks to the previous tool. In section 2 we introduce shortly the notation, explain the main ingredients used and obtain a first bound based on controlling separately the first D eigenvectors, and depending on the dimension D. In section 3 we explain why the first bound is actually suboptimal and derive an improved bound as a consequence of an operator perturbation result that is more adapted to our needs and deals directly with the D-eigenspace as a whole. Section 4 concludes and discusses the obtained results. Mathematical proofs are found in the appendix. 2 First result. Notation. The interest variable X takes its values in some measurable space X , following the distribution P . We consider KPCA and are therefore primarily interested in the mapping of X into a reproducing kernel Hilbert space H with kernel function k through the feature mapping ϕ(x) = k(x, ·). The objective of the kernel PCA procedure is to recover a D-dimensional subspace SD of H such that the projection of ϕ(X) on SD has maximum averaged squared norm. All operators considered in what follows are Hilbert-Schmidt and the norm considered for these operators will be the Hilbert-Schmidt norm unless precised otherwise. Furthermore we only consider symmetric nonnegative operators, so that they can be diagonalized and have a discrete spectrum. Let C denote the covariance operator of variable ϕ(X). To simplify notation we assume that nonzero eigenvalues λ1 > λ2 > . . . of C are all simple (This is for convenience only. In the conclusion we discuss what changes have to be made if this is not the case). Let φ1 , φ2 , . . . be the associated eigenvectors. It is well-known that the optimal D-dimensional reconstruction space is SD = span{φ1 , . . . , φD }. The KPCA procedure approximates this objective by considering the empirical covariance operator, denoted Cn , and the subspace SD spanned by its first D eigenvectors. We denote PSD , PSD the orthogonal projectors on b these spaces. A first bound. Broadly speaking, the main steps required to obtain the type of result we are interested in are 1. A non-asympotic bound on the (Hilbert-Schmidt) norm of the difference between the empirical and the true covariance operators; 2. An operator perturbation result bounding the difference between spectral projectors of two operators by the norm of their difference. The combination of these two steps leads to our goal. The first step consists in the following Lemma coming from [9]: Lemma 1 (Corollary 5 of [9]) Supposing that supx∈X k(x, x) ≤ M , with probability greater than 1 − e−ξ , 2M ξ Cn − C ≤ √ 1+ . 2 n As for the second step, [7] provides the following perturbation bound (see also e.g. [12]): Theorem 2 (Simplified Version of [7], Theorem 5.2 ) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple positive eigenvalues λ 1 > 1 λ2 > . . . For an integer r such that λr > 0, let δr = δr ∧ δr−1 where δr = 2 (λr − λr+1 ). Let B ∈ HS(H) be another symmetric operator such that B < δr /2 and (A + B) is still a positive operator with simple nonzero eigenvalues. Let Pr (A) (resp. Pr (A + B)) denote the orthogonal projector onto the subspace spanned by the r-th eigenvector of A (resp. (A + B)). Then, these projectors satisfy: Pr (A) − Pr (A + B) ≤ 2 B δr . Remark about the Approximation Error of the Eigenvectors: let us recall that a control over the Hilbert-Schmidt norm of the projections onto eigenspaces imply a control on the approximation errors of the eigenvectors themselves. Indeed, let φr , ψr denote the (normalized) r-th eigenvectors of the operators above with signs chosen so that φ r , ψr > 0. Then P φ r − P ψr 2 2 = 2(1 − φr , ψr ) ≥ 2(1 − φr , ψr ) = φr − ψr 2 . Now, the orthogonal projector on the direct sum of the first D eigenspaces is the sum D r=1 Pr . Using the triangle inequality, and combining Lemma 1 and Theorem 2, we conclude that with probability at least 1 − e−ξ the following holds: D P SD − P SD ≤ b −1 δr r=1 4M √ n 1+ ξ 2 , 2 provided that n ≥ 16M 2 1 + ξ 2 −2 (sup1≤r≤D δr ) . The disadvantage of this bound is that we are penalized on the one hand by the (inverse) gaps between the eigenvalues, and on the other by the dimension D (because we have to sum the inverse gaps from 1 to D). In the next section we improve the operator perturbation bound to get an improved result where only the gap δD enters into account. 3 Improved Result. We first prove the following variant on the operator perturbation property which better corresponds to our needs by taking directly into account the projection on the first D eigenvectors at once. The proof uses the same kind of techniques as in [7]. Theorem 3 Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple nonzero eigenvalues λ1 > λ2 > . . . Let D > 0 be an integer such that λD > 0, δD = 1 (λD − λD+1 ). Let B ∈ HS(H) be another symmetric operator such that 2 B < δD /2 and (A + B) is still a positive operator. Let P D (A) (resp. P D (A + B)) denote the orthogonal projector onto the subspace spanned by the first D eigenvectors A (resp. (A + B)). Then these satisfy: P D (A) − P D (A + B) ≤ B . δD (1) This then gives rise to our main result on KPCA: Theorem 4 Assume that supx∈X k(x, x) ≤ M . Let SD , SD be the subspaces spanned by the first D eigenvectors of C, resp. Cn defined earlier. Denoting λ1 > λ2 > . . . the 1 eigenvalues of C, if D > 0 is such that λD > 0, put δD = 2 (λD − λD+1 ) and BD = 2M δD 1+ ξ 2 . 2 Then provided that n ≥ BD , the following bound holds with probability at least 1 − e−ξ : BD P SD − P SD ≤ √ . b n (2) This entails in particular ⊥ SD ⊂ g + h, g ∈ SD , h ∈ SD , h 1 Hk ≤ 2BD n− 2 g Hk . (3) The important point here is that the approximation error now only depends on D through the (inverse) gap between the D-th and (D + 1)-th eigenvalues. Note that using the results of section 2, we would have obtained exactly the same bound for estimating the D-th eigenvector only – or even a worse bound since δD = δD ∧ δD−1 appears in this case. Thus, at least from the point of view of this technique (which could still yield suboptimal bounds), there is no increase of complexity between estimating the D-th eigenvector alone and estimating the span of the first D eigenvectors. Note that the inclusion (3) can be interpreted geometrically by saying that for any vector in SD , the √ tangent of the angle between this vector and its projection on SD is upper bounded by BD / n, which we can interpret as a stability property. Comment about the Centered Case. In the actual (K)PCA procedure, the data is actually first empirically recentered, so that one has to consider the centered covariance operator C and its empirical counterpart C n . A result similar to Theorem 4 also holds in this case (up to some additional constant factors). Indeed, a result similar to Lemma 1 holds for the recentered operators [2]. Combined again with Theorem 3, this allows to come to similar conclusions for the “true” centered KPCA. 4 Conclusion and Discussion In this paper, finite sample size confidence bounds of the eigenspaces of Kernel-PCA (the D-eigenspaces of the empirical covariance operator) are provided using tools of operator perturbation theory. This provides a first step towards an in-depth complexity analysis of algorithms using KPCA as pre-processing, and towards taking into account the randomness of the obtained models (e.g. [3]). We proved a bound in which the complexity factor for estimating the eigenspace SD by its empirical counterpart depends only on the inverse gap between the D-th and (D + 1)-th eigenvalues. In addition to the previously cited works, we take into account the centering of the data and obtain comparable rates. In this work we assumed for simplicity of notation the eigenvalues to be simple. In the case the covariance operator C has nonzero eigenvalues with multiplicities m1 , m2 , . . . possibly larger than one, the analysis remains the same except for one point: we have to assume that the dimension D of the subspaces considered is of the form m1 + · · · + mr for a certain r. This could seem restrictive in comparison with the results obtained for estimating the sum of the first D eigenvalues themselves [2] (which is linked to the reconstruction error in KPCA) where no such restriction appears. However, it should be clear that we need this restriction when considering D−eigenspaces themselves since the target space has to be unequivocally defined, otherwise convergence cannot occur. Thus, it can happen in this special case that the reconstruction error converges while the projection space itself does not. Finally, a common point of the two analyses (over the spectrum and over the eigenspaces) lies in the fact that the bounds involve an inverse gap in the eigenvalues of the true covariance operator. Finally, how tight are these bounds and do they at least carry some correct qualitative information about the behavior of the eigenspaces? Asymptotic results (central limit Theorems) in [6, 4] always provide the correct goal to shoot for since they actually give the limit distributions of these quantities. They imply that there is still important ground to cover before bridging the gap between asymptotic and non-asymptotic. This of course opens directions for future work. Acknowledgements: This work was supported in part by the PASCAL Network of Excellence (EU # 506778). A Appendix: proofs. Proof of Lemma 1. This lemma is proved in [9]. We give a short proof for the sake of n 1 completness. Cn − C = n i=1 CXi − E [CX ] with CX = ϕ(X) ⊗ ϕ(X)∗ = k(X, X) ≤ M . We can apply the bounded difference inequality to the variable Cn − C , so that with probability greater than 1 − e−ξ , Cn − C ≤ E [ Cn − C ] + 2M Moreover, by Jensen’s inequality E [ Cn − C ] ≤ E n 1 simple calculations leads to E n i=1 CXi − E [CX ] 2 4M n . This concludes the proof of lemma 1. 1 n 2 n i=1 CXi 1 = nE 2 ξ 2n . 1 2 − E [CX ] , and CX − E [CX ] 2 ≤ Proof of Theorem 3. The variation of this proof with respect to Theorem 5.2 in [7] is (a) to work directly in a (infinite-dimensional) Hilbert space, requiring extra caution for some details and (b) obtaining an improved bound by considering D-eigenspaces at once. The key property of Hilbert-Schmidt operators allowing to work directly in a infinite dimensional setting is that HS(H) is a both right and left ideal of Lc (H, H), the Banach space of all continuous linear operators of H endowed with the operator norm . op . Indeed, ∀ T ∈ HS(H), ∀S ∈ Lc (H, H), T S and ST belong to HS(H) with TS ≤ T S ST ≤ T and op S op . (4) The spectrum of an Hilbert-Schmidt operator T is denoted Λ(T ) and the sequence of eigenvalues in non-increasing order is denoted λ(T ) = (λ1 (T ) ≥ λ2 (T ) ≥ . . .) . In the following, P D (T ) denotes the orthogonal projector onto the D-eigenspace of T . The Hoffmann-Wielandt inequality in infinite dimensional setting[1] yields that: λ(A) − λ(A + B) 2 ≤ B ≤ δD . 2 (5) implying in particular that ∀i > 0, |λi (A) − λi (A + B)| ≤ δD . 2 (6) Results found in [5] p.39 yield the formula P D (A) − P D (A + B) = − 1 2iπ γ (RA (z) − RA+B (z))dz ∈ Lc (H, H) . (7) where RA (z) = (A − z Id)−1 is the resolvent of A, provided that γ is a simple closed curve in C enclosing exactly the first D eigenvalues of A and (A + B). Moreover, the same reference (p.60) states that for ξ in the complementary of Λ(A), RA (ξ) op = dist(ξ, Λ(A)) −1 . (8) The proof of the theorem now relies on the simple choice for the closed curve γ in (7), drawn in the picture below and consisting of three straight lines and a semi-circle of radius D L. For all L > δ2 , γ intersect neither the eigenspectrum of A (by equation (6)) nor the eigenspectrum of A + B. Moreover, the eigenvalues of A (resp. A + B) enclosed by γ are exactly λ1 (A), . . . , λD (A) (resp. λ1 (A + B), . . . , λD (A + B)). Moreover, for z ∈ γ, T (z) = RA (z) − RA+B (z) = −RA+B (z)BRA (z) belongs to HS(H) and depends continuously on z by (4). Consequently, P D (A) − P D (A + B) ≤ 1 2π b a (RA − RA+B )(γ(t)) |γ (t)|dt . N Let SN = n=0 (−1)n (RA (z)B)n RA (z). RA+B (z) = (Id + RA (z)B)−1 RA (z) and, for z ∈ γ and L > δD , RA (z)B op ≤ RA (z) op B ≤ δD 1 ≤ , 2 dist(z, Λ(A)) 2 γ L L δD λ 0 D+1 δD λ2 λD λ1 δD δD δD 2 2 2 L . op imply that SN −→ RA+B (z) (uniformly for z ∈ γ). Using property (4), since B ∈ . HS(H), SN BRA (z) −→ RA+B (z)BRA (z) = RA+B (z) − RA (z) . Finally, RA (z) − RA+B (z) = (−1)n (RA (z)B)n RA (z) n≥1 where the series converges in HS(H), uniformly in z ∈ γ. Using again property (4) and (8) implies B n (RA − RA+B )(γ(t)) ≤ RA (γ(t)) n+1 B n ≤ op distn+1 (γ(t), Λ(A)) n≥1 Finally, since for L > δD , B ≤ δD 2 n≥1 ≤ dist(γ(t),Λ(A)) , 2 b B 1 |γ (t)|dt . 2 (γ(t), Λ(A)) π a dist Splitting the last integral into four parts according to the definition of the contour γ, we obtain P D (A) − P D (A + B) ≤ 2arctan( δL ) 1 µ1 (A) − (µD (A) − δD ) π D |γ (t)|dt ≤ + +2 , dist2 (γ(t), Λ(A)) δD L L2 a and letting L goes to infinity leads to the result. b Proof of Theorem 4. Lemma 1 and Theorem 3 yield inequality (2). Together with as1 2 sumption n ≥ BD it implies PSD − PSD ≤ 2 . Let f ∈ SD : f = PSD (f ) + PSD (f ) . ⊥ b Lemma 5 below with F = SD and G = SD , and the fact that the operator norm is bounded by the Hilbert-Schmidt norm imply that 4 PSD (f ) 2 k ≤ PSD − PSD 2 PSD (f ) 2 k . ⊥ b H H 3 Gathering the different inequalities, Theorem 4 is proved. Lemma 5 Let F and G be two vector subspaces of H such that PF − PG the following bound holds: 4 PF − PG 2 PF (f ) 2 . ∀ f ∈ G , PF ⊥ (f ) 2 ≤ H op H 3 op 1 ≤ 2 . Then Proof of Lemma 5. = f − PF (f ) 2 = (PG − PF )(f ) 2 = PF − PF ⊥ (f ) 2 For f ∈ G, we have PG (f ) = f , hence PF (f ) 2 op PG 2 op ≤ P F − PG gathering the terms containing PF ⊥ (f ) 1/4 leads to the conclusion. 2 f 2 2 + PF ⊥ (f ) 2 on the left-hand side and using PF −PG 2 op ≤ References [1] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in infinite dimensions. Proc.Indian Acad.Sci(Math. Sci.) 104 (3), p. 483-494, 1994. [2] G. Blanchard, O. Bousquet, and L. Zwald. Statistical Properties of Kernel Principal Component Analysis. Proceedings of the 17th. Conference on Learning Theory (COLT 2004), p. 594–608. Springer, 2004. [3] G. Blanchard, P. Massart, R. Vert, and L. Zwald. Kernel projection machine: a new tool for pattern recognition. Proceedings of the 18th. Neural Information Processing System (NIPS 2004), p. 1649–1656. MIT Press, 2004. [4] J. Dauxois, A. Pousse, and Y. Romain. Asymptotic theory for the Principal Component Analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis 12, 136-154, 1982. [5] T. Kato. Perturbation Theory for Linear Operators. New-York: Springer-Verlag, 1966. [6] V. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191–227, 1998. [7] V. Koltchinskii and E. Gin´ . Random matrix approximation of spectra of integral e operators. Bernoulli, 6(1):113–167, 2000. [8] B. Sch¨ lkopf, A. J. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a o u kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [9] J. Shawe-Taylor and N. Cristianini. Estimating the moments of a random vector with applications. Proceedings of the GRETSI 2003 Conference, p. 47-52, 2003. [10] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix and the generalisation error of Kernel PCA. IEEE Transactions on Information Theory 51 (7), p. 2510-2522, 2005. [11] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Technical Report 134, Max Planck Institute for Biological Cybernetics, 2004. [12] U. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering on random samples: the normalized case. Proceedings of the 17th Annual Conference on Learning Theory (COLT 2004), p. 457–471. Springer, 2004. [13] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. Proceedings of the 17th International Conference on Machine Learning (ICML), p. 1159–1166. Morgan Kaufmann, 2000.
6 0.078410044 47 nips-2005-Consistency of one-class SVM and related algorithms
7 0.074858442 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
8 0.072259128 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
9 0.062355135 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
10 0.051747814 80 nips-2005-Gaussian Process Dynamical Models
11 0.047461763 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
12 0.047073931 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
13 0.046760701 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
14 0.046756838 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
15 0.044486891 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
16 0.044365 74 nips-2005-Faster Rates in Regression via Active Learning
17 0.036428854 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
18 0.030412471 77 nips-2005-From Lasso regression to Feature vector machine
19 0.029786462 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
20 0.02937126 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
topicId topicWeight
[(0, 0.099), (1, 0.06), (2, -0.042), (3, -0.078), (4, 0.04), (5, 0.043), (6, -0.013), (7, -0.021), (8, 0.003), (9, 0.028), (10, 0.034), (11, 0.145), (12, 0.004), (13, 0.014), (14, -0.077), (15, 0.096), (16, -0.015), (17, -0.005), (18, 0.159), (19, 0.001), (20, 0.023), (21, -0.085), (22, -0.018), (23, -0.027), (24, -0.079), (25, -0.031), (26, 0.069), (27, -0.148), (28, -0.135), (29, -0.145), (30, -0.046), (31, -0.04), (32, 0.023), (33, -0.07), (34, -0.119), (35, -0.014), (36, -0.09), (37, 0.061), (38, -0.176), (39, -0.006), (40, -0.206), (41, 0.012), (42, -0.036), (43, -0.001), (44, -0.062), (45, 0.087), (46, 0.044), (47, -0.077), (48, 0.044), (49, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.9685598 182 nips-2005-Statistical Convergence of Kernel CCA
Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach
Abstract: While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence. 1
2 0.5937987 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
Author: Ross Lippert, Ryan Rifkin
Abstract: We consider regularized least-squares (RLS) with a Gaussian kernel. We prove that if we let the Gaussian bandwidth σ → ∞ while letting the regularization parameter λ → 0, the RLS solution tends to a polynomial 1 whose order is controlled by the rielative rates of decay of σ2 and λ: if λ = σ −(2k+1) , then, as σ → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. We illustrate the result with an example. 1
3 0.5559637 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
Author: Larry Wasserman, John D. Lafferty
Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1
4 0.42638814 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
Author: Aurelie C. Lozano, Sanjeev R. Kulkarni, Robert E. Schapire
Abstract: We study the statistical convergence and consistency of regularized Boosting methods, where the samples are not independent and identically distributed (i.i.d.) but come from empirical processes of stationary β-mixing sequences. Utilizing a technique that constructs a sequence of independent blocks close in distribution to the original samples, we prove the consistency of the composite classifiers resulting from a regularization achieved by restricting the 1-norm of the base classifiers’ weights. When compared to the i.i.d. case, the nature of sampling manifests in the consistency result only through generalization of the original condition on the growth of the regularization parameter.
5 0.39293969 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
6 0.37748364 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
7 0.35918123 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences
8 0.3571206 74 nips-2005-Faster Rates in Regression via Active Learning
9 0.35691276 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
10 0.35300875 47 nips-2005-Consistency of one-class SVM and related algorithms
11 0.32084066 41 nips-2005-Coarse sample complexity bounds for active learning
12 0.26718071 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
13 0.26115403 160 nips-2005-Query by Committee Made Real
14 0.24773738 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
15 0.23934518 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
16 0.2289523 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
17 0.22616786 112 nips-2005-Learning Minimum Volume Sets
18 0.22485265 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
19 0.22426845 103 nips-2005-Kernels for gene regulatory regions
20 0.21347263 71 nips-2005-Fast Krylov Methods for N-Body Learning
topicId topicWeight
[(3, 0.036), (5, 0.025), (10, 0.031), (18, 0.034), (27, 0.017), (31, 0.04), (34, 0.1), (55, 0.022), (69, 0.025), (73, 0.044), (88, 0.062), (90, 0.367), (91, 0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.8458249 182 nips-2005-Statistical Convergence of Kernel CCA
Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach
Abstract: While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence. 1
2 0.68205369 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
Author: Larry Wasserman, John D. Lafferty
Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1
3 0.58536595 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
Author: Zhenyue Zhang, Hongyuan Zha
Abstract: We propose a fast manifold learning algorithm based on the methodology of domain decomposition. Starting with the set of sample points partitioned into two subdomains, we develop the solution of the interface problem that can glue the embeddings on the two subdomains into an embedding on the whole domain. We provide a detailed analysis to assess the errors produced by the gluing process using matrix perturbation theory. Numerical examples are given to illustrate the efficiency and effectiveness of the proposed methods. 1
4 0.37813702 47 nips-2005-Consistency of one-class SVM and related algorithms
Author: Régis Vert, Jean-philippe Vert
Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1
5 0.37208349 105 nips-2005-Large-Scale Multiclass Transduction
Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan
Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1
6 0.36896372 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
7 0.36879894 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
8 0.36834297 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
9 0.36665165 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
10 0.36618599 184 nips-2005-Structured Prediction via the Extragradient Method
11 0.36604142 98 nips-2005-Infinite latent feature models and the Indian buffet process
12 0.36554533 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
13 0.36535403 78 nips-2005-From Weighted Classification to Policy Search
14 0.3630549 50 nips-2005-Convex Neural Networks
15 0.3623822 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
16 0.36179608 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
17 0.36076021 54 nips-2005-Data-Driven Online to Batch Conversions
18 0.35993165 178 nips-2005-Soft Clustering on Graphs
19 0.35960314 133 nips-2005-Nested sampling for Potts models
20 0.3587288 23 nips-2005-An Application of Markov Random Fields to Range Sensing