jmlr jmlr2007 jmlr2007-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kenji Fukumizu, Francis R. Bach, Arthur Gretton
Abstract: While kernel canonical correlation analysis (CCA) has been applied in many contexts, the convergence of finite sample estimates of the associated functions to their population counterparts has not yet been established. This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justification for the method. The proof uses covariance operators defined on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of finite rank to their population counterparts, which can have infinite rank. The result also gives a sufficient condition for convergence on the regularization coefficient involved in kernel CCA: this should decrease as n−1/3 , where n is the number of data. Keywords: canonical correlation analysis, kernel, consistency, regularization, Hilbert space
Reference: text
sentIndex sentText sentNum sentScore
1 This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justification for the method. [sent-10, score-0.101]
2 The proof uses covariance operators defined on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of finite rank to their population counterparts, which can have infinite rank. [sent-11, score-0.249]
3 The result also gives a sufficient condition for convergence on the regularization coefficient involved in kernel CCA: this should decrease as n−1/3 , where n is the number of data. [sent-12, score-0.121]
4 In kernel methods, data are represented as functions or elements in reproducing kernel Hilbert spaces (RKHS), which are associated with positive definite kernels. [sent-15, score-0.192]
5 Many methods have been proposed as nonlinear ¨ extensions of conventional linear methods, such as kernel principal component analysis (Sch olkopf et al. [sent-17, score-0.101]
6 , 2001; Bach and Jordan, 2002) as a nonlinear extension of canonical correlation analysis with positive definite kernels. [sent-21, score-0.11]
7 Since the goal of kernel CCA is to estimate a pair of functions, the convergence should be evaluated in an appropriate functional norm; we thus need tools from functional analysis to characterize the type of convergence. [sent-32, score-0.143]
8 The purpose of this paper is to rigorously prove the statistical consistency of kernel CCA. [sent-33, score-0.099]
9 In proving the consistency of kernel CCA, we show also the consistency of a pair of functions which may be used as an alternative method for expressing the nonlinear dependence of two variables. [sent-34, score-0.151]
10 The main theorems in this paper give a sufficient condition on the decay of the regularization coefficient for the finite sample estimators to converge to the desired functions in the population limit. [sent-37, score-0.095]
11 For kernel CCA, we prove convergence in the L 2 norm, which is a standard distance measure for functions. [sent-41, score-0.101]
12 There are earlier studies relevant to the convergence of functional correlation analysis. [sent-42, score-0.086]
13 An alternative approach is to study the eigenfunctions of the cross-covariance operators, without normalising by the variance, as in the constrained covariance (COCO, Gretton et al. [sent-47, score-0.089]
14 We begin our presentation in Section 2 with a review of kernel CCA and related methods, formulating them in terms of cross-covariance operators, which are the basic tools to analyze correlation in functional spaces. [sent-50, score-0.133]
15 In Section 3, we describe the two main theorems, which respectively show the convergence of kernel CCA and NOCCO. [sent-51, score-0.101]
16 Kernel Canonical Correlation Analysis In this section, we briefly review kernel CCA, following Bach and Jordan (2002), and reformulate it with covariance operators on RKHS. [sent-56, score-0.172]
17 In this paper, a Hilbert space always means a separable Hilbert space, and an operator a linear operator. [sent-58, score-0.085]
18 The operator norm of a bounded operator T is denoted by T and defined as T = sup f =1 T f . [sent-59, score-0.21]
19 The null space and the range of an operator T : H1 → H2 are denoted by N (T ) and R (T ), respectively; that is, N (T ) = { f ∈ H1 | T f = 0} and R (T ) = {T f ∈ H2 | f ∈ H1 }. [sent-60, score-0.085]
20 As we shall see, the regularization terms εn f 2 and εn g 2 make the problem well-formulated statistically, HX HY enforce smoothness, and enable operator inversion, as in Tikhonov regularization (Groetsch, 1984). [sent-97, score-0.125]
21 The cross-covariance 364 S TATISTICAL C ONSISTENCY OF K ERNEL CCA operator2 of (X,Y ) is an operator from HX to HY , which is defined by g, ΣY X f HY = EXY ( f (X) − EX [ f (X)])(g(Y ) − EY [g(Y )]) (= Cov[ f (X), g(Y )]) for all f ∈ HX and g ∈ HY . [sent-112, score-0.085]
22 By regarding the right hand side as a linear functional on the direct product HX ⊗ HY , Riesz’s representation theorem (Reed and Simon, 1980, for example) guarantees the existence and uniqueness of a bounded operator ΣY X . [sent-113, score-0.106]
23 The cross-covariance operator expresses the covariance between functions in the RKHS as a bilinear functional, and contains all the information regarding the dependence of X and Y expressible by nonlinear functions in the RKHS. [sent-114, score-0.138]
24 Obviously, ΣY X = Σ∗ , where T ∗ denotes the adjoint of an operator T . [sent-115, score-0.085]
25 In particular, if Y XY is equal to X, the self-adjoint operator ΣXX is called the covariance operator. [sent-116, score-0.111]
26 Using the mean elements, the cross-covariance operator ΣY X is rewritten g, ΣY X f HY = EXY [ f , kX (·, X) − mX HX kY (·,Y ) − mY , g HY ]. [sent-124, score-0.085]
27 The empir(n) ical cross-covariance operator ΣY X is defined as the cross-covariance operator with the empirical (n) 1 distribution n ∑n δXi δYi . [sent-132, score-0.17]
28 By definition, for any f ∈ HX and g ∈ HY , the operator ΣY X gives the i=1 empirical covariance as follows; (n) g, ΣY X f HY 1 n 1 n 1 n = ∑ g, kY (·,Yi ) − ∑ kY (·,Ys ) kX (·, Xi ) − ∑ kX (·, Xt ), f n i=1 n s=1 n t=1 HY HX = Cov[ f (X), G(Y )]. [sent-133, score-0.111]
29 It is known (Baker, 1973, Theorem 1) that ΣY X has a representation 1/2 1/2 ΣY X = ΣYY VY X ΣXX , (5) where VY X : HX → HY is a unique bounded operator such that VY X ≤ 1 and VY X = QY VY X QX . [sent-136, score-0.085]
30 Cross-covariance operator have been defined for Banach spaces by Baker (1973). [sent-140, score-0.111]
31 3 Representation of Kernel CCA and Related Methods with Cross-covariance Operators With cross-covariance operators for (X,Y ), the kernel CCA problem can be formulated as sup f ∈HX ,g∈HY g, ΣY X f HY f , ΣXX f HX = 1, g, ΣYY g HY = 1. [sent-143, score-0.146]
32 subject to As with classical CCA (Anderson, 2003, for example), the solution of the above kernel CCA problem is given by the eigenfunctions corresponding to the largest eigenvalue of the following generalized eigenproblem: ΣY X f = ρ1 ΣYY g, (6) ΣXY g = ρ1 ΣXX f . [sent-144, score-0.155]
33 In the empirical case, let φn ∈ HX and ψn ∈ HY be the unit eigenfunctions corresponding to the largest singular value of the finite rank operator −1/2 (n) −1/2 (n) (n) (n) ΣY X ΣXX + εn I . [sent-158, score-0.166]
34 (7), the empirical estimators fn and gn of kernel CCA are (n) (n) fn = (ΣXX + εn I)−1/2 φn , gn = (ΣYY + εn I)−1/2 ψn . [sent-160, score-0.448]
35 While we presume that the eigenspaces are one dimensional in this section, we can easily relax it to multidimensional spaces by considering the eigenspaces corresponding to the largest eigenvalues. [sent-164, score-0.094]
36 366 S TATISTICAL C ONSISTENCY OF K ERNEL CCA The empirical operators and the estimators described above can be expressed using Gram matrices, as is often done in kernel methods. [sent-168, score-0.146]
37 The solutions fn and gn are exactly the same as those given in Bach and Jordan (2002), as we confirm below. [sent-169, score-0.187]
38 j=1 (n) Because R (ΣXX ) and R (ΣYY ) are spanned by (ui )n and (vi )n , respectively, the eigenfunctions i=1 i=1 (n) of VY X are given by a linear combination of ui and vi . [sent-171, score-0.107]
39 The solution of the kernel CCA problem is n fn = (ΣXX + εn I)−1/2 φn = ∑ ξi ui , (n) (n) i=1 where ξ= √ n gn = (ΣYY + εn I)−1/2 ψn = ∑ ζi vi , i=1 n(GX + nεn In )−1/2 α and ζ= √ n(GY + nεn In )−1/2 β. [sent-173, score-0.305]
40 Note that our theoretical results X 2 in the next section still hold with this approximation, because this modification causes only higher order changes in α and β, which perturbs the empirical eigenfunctions φn , ψn , fn , and gn only in higher order. [sent-176, score-0.25]
41 , 2005b) uses the unit eigenfunctions of the cross-covariance operator ΣY X . [sent-179, score-0.148]
42 Unlike kernel CCA, COCO normalizes the covariance by the RKHS norms of f and g. [sent-183, score-0.1]
43 On the other hand, kernel CCA may encounter situations where it finds functions with moderately large covariance but very small variances for f (X) or g(Y ), since ΣXX and ΣYY can have arbitrarily small eigenvalues. [sent-186, score-0.1]
44 We will establish the consistency of kernel CCA and NOCCO in the next section, and give experimental comparisons of these methods in Section 4. [sent-191, score-0.099]
45 n→∞ εn lim εn = 0, lim n→∞ (9) Assume VY X is a compact operator and the eigenspaces which attain the singular value problem max φ∈HX ,ψ∈HY φ H = ψ H =1 X ψ,VY X φ HY Y (n) are one-dimensional. [sent-195, score-0.164]
46 The next main result shows the convergence of kernel CCA in the norm of L 2 (PX ) and L2 (PY ). [sent-198, score-0.141]
47 Then, ( fn − EX [ fn (X)]) − ( f − EX [ f (X)]) L2 (PX ) →0 (gn − EY [gn (Y )]) − (g − EY [g(Y )]) L2 (PY ) →0 and in probability, as n goes to infinity. [sent-202, score-0.308]
48 While in the above theorems we confine our attention to the first eigenfunctions, it is not difficult to verify the convergence of eigenspaces corresponding to the m-th largest eigenvalue by extending Lemma 10 in the Appendix. [sent-203, score-0.102]
49 It is known e (Buja, 1990) that the covariance operator considered on L 2 is Hilbert-Schmidt if the mean square contingency is finite. [sent-217, score-0.155]
50 We modify the result to the case of the covariance operator on RKHS. [sent-218, score-0.111]
51 If the mean square contingency C(X,Y ) is finite, that is, if Z Z pXY (x, y)2 dµX dµY < ∞, pX (x)pY (y) then the operator VY X : HX → HY is Hilbert-Schmidt, and VY X HS ≤ C(X,Y ) = ζ 369 L2 (PX ×PY ) . [sent-226, score-0.129]
52 The assumption that the mean square contingency is finite is very natural when we consider the dependence of two different random variables, as in the situation where kernel CCA is applied. [sent-228, score-0.118]
53 (2003) discuss the eigendecomposition of the operator VY X . [sent-231, score-0.106]
54 (1993) discuss canonical correlation analysis on curves, which are represented by stochastic processes on an interval, and use the Sobolev space of functions with square integrable second derivative. [sent-237, score-0.113]
55 Although the proof can be extended to a general RKHS, the convergence is measured by that of the correlation, fn , ΣXX f HX → 1. [sent-240, score-0.181]
56 1/2 1/2 f , ΣXX f HX fn , ΣXX fn HX Note that in the denominator the population covariance fn , ΣXX fn HX is used, which is not computable in practice. [sent-241, score-0.674]
57 The above convergence of correlation is weaker than the L 2 convergence in Theorem 2. [sent-242, score-0.092]
58 On the other hand, the convergence of correlation does not imply ( fn − f ), ΣXX ( fn − f ) HX . [sent-244, score-0.373]
59 From the equality 1/2 ( fn − f ), ΣXX ( fn − f ) HX = 1/2 2 fn , ΣXX fn H − f , ΣXX f H X X fn , ΣXX f HX 1/2 1/2 ΣXX fn HX ΣXX f HX , + 2 1 − 1/2 1/2 ΣXX fn HX ΣXX f HX we require the convergence fn , ΣXX fn HX → f , ΣXX f HX = 1 in order to guarantee the left hand (n) side converges to zero. [sent-245, score-1.433]
60 With the normalization fn , (ΣXX + εn I) fn HX = f , ΣXX f HX = 1, however, the convergence of fn , ΣXX fn HX is not clear. [sent-246, score-0.643]
61 We use the stronger assumption n−1/3 /εn → 0 to prove ( fn − f ), ΣXX ( fn − f ) HX → 0 in Theorem 2. [sent-247, score-0.308]
62 We use a synthetic data set for which the optimal nonlinear functions in the population kernel CCA (Eq. [sent-250, score-0.133]
63 For the quantitative evaluation of convergence, we consider only kernel CCA, because the exact solutions for NOCCO or COCO in population are not known in closed form. [sent-252, score-0.106]
64 Thus, the i i maximum canonical correlation in population is attained by f (x) = 1. [sent-276, score-0.115]
65 We perform kernel CCA, NOCCO, and COCO with Gaussian RBF kernel k(x, y) = exp(− x − y 2 ) on the data. [sent-279, score-0.148]
66 In the plots (e)-(g), kernel CCA gives linearly correlated feature vectors, while NOCCO and COCO do not aim at obtaining linearly correlated vectors. [sent-285, score-0.102]
67 Next, we conduct numerical simulations to verify the convergence rate of kernel CCA. [sent-287, score-0.101]
68 For the estimated functions fn and gn with the data sizes n = 10, 25, 50, 75, 100, 250, 500, 750, and 1000, the L 2 (PX ) and L2 (PY ) distances between the estimated and true functions are evaluated by generating 10000 samples from the true distribution. [sent-292, score-0.187]
69 The plots of the transformed data ( fn (Xi ), gn (Yi )) are given in (e)-(g). [sent-334, score-0.187]
70 Note that in (e)-(g) the clear linear correlation is seen in (e), only because it is the criterion of the kernel CCA; the other two methods use different criterion, but still show strong correlation. [sent-335, score-0.112]
71 Let H be an RKHS on a measurable set X with a measurable positive definite kernel k. [sent-368, score-0.114]
72 2 Preliminary Lemmas (n) For the proof of the main theorems, we show the empirical estimate VY X converges in norm to −1/2 −1/2 the normalized cross-covariance operator VY X = ΣYY ΣY X ΣXX for an appropriate order of the regularization coefficient εn . [sent-414, score-0.165]
73 Because φ and ψ are the eigenfunctions corresponding to the largest eigenvalue of VY X VXY and VXY VY X , respectively, and a similar fact holds for φn and ψn , the assertion is obtained by Lemma 10 in Appendix. [sent-455, score-0.095]
74 Proof of Theorem 2 We show only the convergence of fn . [sent-456, score-0.181]
75 The squared L2 (PX ) distance between fn − EX [ fn (X)] and f − EX [ f (X)] is given by 2 1/2 1/2 2 1/2 ΣXX ( fn − f ) H = ΣXX fn H − 2 φ, ΣXX fn HX + φ 2 X . [sent-458, score-0.77]
76 H X X 1/2 Thus, it suffices to show ΣXX fn converges to φ ∈ HX in probability. [sent-459, score-0.174]
77 We have 1/2 1/2 ΣXX fn − φ H ≤ ΣXX X (n) ΣXX + εn I −1/2 − ΣXX + εn I 1/2 −1/2 1/2 ΣXX −1/2 + ΣXX ΣXX + εn I + ΣXX + εn I φn − φ −1/2 φn H X HX φ−φ H . [sent-460, score-0.154]
78 Concluding Remarks We have established the statistical convergence of kernel CCA and NOCCO, showing that the finite sample estimators of the relevant nonlinear mappings converge to the desired population functions. [sent-475, score-0.16]
79 This convergence is proved in the RKHS norm for NOCCO, and in the L 2 norm for kernel CCA. [sent-476, score-0.181]
80 A result relevant to the convergence of kernel principal component analysis (KPCA) has recently been obtained by Zwald and Blanchard (2006). [sent-482, score-0.101]
81 If a parameterized family of kernels such as the Gaussian RBF 378 S TATISTICAL C ONSISTENCY OF K ERNEL CCA kernel is provided, then cross-validation might be a reasonable method to select the best kernel (see Leurgans et al. [sent-492, score-0.148]
82 One of the methods related to kernel CCA is independent component analysis (ICA), since Bach and Jordan (2002) use kernel CCA in their kernel ICA algorithm. [sent-494, score-0.222]
83 The theoretical results developed in this paper will work as a basis for analyzing the properties of the kernel ICA algorithm; in particular, for demonstrating statistical consistency of the estimator. [sent-495, score-0.099]
84 Since ICA estimates the demixing matrix as a parameter, however, we need to consider covariance operators parameterized by this matrix, and must discuss how convergence of the objective function depends on the parameter. [sent-496, score-0.125]
85 It is not a straightforward task to obtain consistency of kernel ICA from the results of this paper. [sent-497, score-0.099]
86 A bounded operator T : H1 → H2 is called compact if for every bounded sequence { f n } ⊂ H1 the image {T f n } has a subsequence which converges in H2 . [sent-508, score-0.132]
87 For a compact operator T : H1 → H2 , there exist N ∈ N ∪ {∞}, a non-increasing sequence of positive numbers {λ i }N , and i=1 (not necessarily complete) orthonormal systems {φi }N ⊂ H1 and {ψi }N ⊂ H2 such that i=1 i=1 N T = ∑ λ i φ i , · H1 ψ i . [sent-511, score-0.112]
88 For two Hilbert-Schmidt operators T1 and T2 , the Hilbert-Schmidt inner product is defined by ∞ T1 , T2 HS = ∑ T1 ϕi , T2 ϕi H2 , i=1 with which the set of all Hilbert-Schmidt operators from H1 to H2 is a Hilbert space. [sent-516, score-0.144]
89 In fact, because direct differ3 entiation yields b0 = 1, b1 = − 2 , and bn > 0 for n ≥ 2, the inequality N N 3 N 3 ∑ |bn | = 1 + 2 + ∑ bn = 1 + 2 + lim ∑ bn xn x↑1 n=0 n=2 n=2 3 3 =3 ≤ 1 + + lim f (x) − 1 + 2 x↑1 2 shows the convergence of ∑∞ bn zn for |z| = 1. [sent-528, score-0.149]
90 Suppose An and A are bounded operators on H2 , and B is a compact operator from H1 to H2 such that for all u ∈ H0 , and An u → Au sup An ≤ M n for some M > 0. [sent-537, score-0.184]
91 Next, assume that the operator norm An B − AB does not converge to zero. [sent-543, score-0.125]
92 Because B is compact and vn H1 = 1, there is a subsequence un and u∗ in H2 such that un → u∗ . [sent-547, score-0.1]
93 Lemma 10 Let A be a compact positive operator on a Hilbert space H , and A n (n ∈ N) be bounded positive operators on H such that An converges to A in norm. [sent-550, score-0.204]
94 n n On the other hand, the convergence | fn , A fn − φ1 , Aφ1 | ≤ | fn , A fn − fn , An fn | + | fn , An fn − φ1 , Aφ1 | ≤ A − A n + An − A → 0 implies that f n , A fn must converges to ρ1 . [sent-556, score-1.433]
95 Note that from the norm convergence Qn An Qn → QAQ, where Qn and Q are the orthogonal projections onto the orthogonal complements of φn and φ, respectively, we have convergence of the eigenvector corresponding to the second eigenvalue. [sent-558, score-0.094]
96 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-596, score-0.092]
97 Nonlinear component analysis as o a kernel eigenvalue problem. [sent-666, score-0.092]
98 An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. [sent-669, score-0.118]
99 Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. [sent-686, score-0.171]
100 On the convergence of eigenspaces in kernel principal com¨ ponent analysis. [sent-689, score-0.135]
wordName wordTfidf (topN-words)
[('hx', 0.437), ('yy', 0.407), ('xx', 0.392), ('cca', 0.384), ('hy', 0.247), ('vy', 0.193), ('fn', 0.154), ('kx', 0.139), ('ky', 0.13), ('bach', 0.116), ('nocco', 0.112), ('px', 0.094), ('py', 0.092), ('pxy', 0.091), ('operator', 0.085), ('coco', 0.084), ('hs', 0.077), ('retton', 0.077), ('tatistical', 0.077), ('ukumizu', 0.077), ('kernel', 0.074), ('operators', 0.072), ('onsistency', 0.065), ('rkhs', 0.063), ('eigenfunctions', 0.063), ('ey', 0.058), ('gretton', 0.053), ('gx', 0.053), ('ernel', 0.049), ('hilbert', 0.049), ('canonical', 0.045), ('leurgans', 0.042), ('norm', 0.04), ('ex', 0.038), ('correlation', 0.038), ('mx', 0.037), ('eigenspaces', 0.034), ('gn', 0.033), ('fi', 0.033), ('population', 0.032), ('fukumizu', 0.032), ('contingency', 0.029), ('exy', 0.028), ('un', 0.028), ('gi', 0.028), ('compact', 0.027), ('cov', 0.027), ('convergence', 0.027), ('nonlinear', 0.027), ('fg', 0.026), ('francis', 0.026), ('spaces', 0.026), ('gy', 0.026), ('bn', 0.026), ('covariance', 0.026), ('consistency', 0.025), ('arthur', 0.024), ('var', 0.024), ('theorems', 0.023), ('ui', 0.023), ('ica', 0.022), ('lemma', 0.022), ('vi', 0.021), ('eigendecomposition', 0.021), ('ry', 0.021), ('suetani', 0.021), ('zwald', 0.021), ('functional', 0.021), ('decay', 0.02), ('converges', 0.02), ('regularization', 0.02), ('measurable', 0.02), ('xy', 0.019), ('eigenvalue', 0.018), ('reproducing', 0.018), ('singular', 0.018), ('zn', 0.018), ('bx', 0.018), ('mf', 0.018), ('gj', 0.018), ('kcca', 0.018), ('reed', 0.018), ('vn', 0.017), ('lemmas', 0.017), ('baker', 0.016), ('kenji', 0.016), ('fj', 0.016), ('supx', 0.016), ('rx', 0.016), ('jordan', 0.015), ('square', 0.015), ('qn', 0.015), ('integrable', 0.015), ('riesz', 0.015), ('assertion', 0.014), ('correlated', 0.014), ('bvn', 0.014), ('chaotic', 0.014), ('dpxy', 0.014), ('dpy', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
Author: Kenji Fukumizu, Francis R. Bach, Arthur Gretton
Abstract: While kernel canonical correlation analysis (CCA) has been applied in many contexts, the convergence of finite sample estimates of the associated functions to their population counterparts has not yet been established. This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justification for the method. The proof uses covariance operators defined on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of finite rank to their population counterparts, which can have infinite rank. The result also gives a sufficient condition for convergence on the regularization coefficient involved in kernel CCA: this should decrease as n−1/3 , where n is the number of data. Keywords: canonical correlation analysis, kernel, consistency, regularization, Hilbert space
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classification, consistency, Bayes risk
3 0.094470613 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert
Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT
4 0.07939513 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
5 0.055908535 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
Author: Yann Guermeur
Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classifiers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classifiers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection
6 0.054301061 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
7 0.04572523 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
8 0.036163889 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
9 0.035616491 70 jmlr-2007-Ranking the Best Instances
11 0.033032537 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
12 0.029898657 71 jmlr-2007-Refinable Kernels
13 0.028774517 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
14 0.028479042 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
15 0.026694557 9 jmlr-2007-AdaBoost is Consistent
16 0.025695434 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
17 0.02557935 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution
18 0.025211465 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
19 0.024670959 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)
20 0.023598138 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
topicId topicWeight
[(0, 0.143), (1, -0.187), (2, 0.151), (3, 0.063), (4, -0.014), (5, 0.053), (6, -0.046), (7, 0.014), (8, -0.007), (9, 0.017), (10, 0.08), (11, 0.041), (12, -0.056), (13, 0.033), (14, -0.02), (15, 0.006), (16, -0.052), (17, 0.08), (18, 0.006), (19, 0.182), (20, 0.236), (21, 0.072), (22, 0.306), (23, -0.094), (24, -0.011), (25, -0.064), (26, 0.2), (27, -0.14), (28, -0.091), (29, 0.187), (30, -0.247), (31, -0.152), (32, 0.088), (33, -0.006), (34, -0.015), (35, -0.144), (36, 0.053), (37, -0.12), (38, -0.046), (39, -0.093), (40, 0.022), (41, -0.08), (42, 0.029), (43, 0.028), (44, 0.135), (45, -0.014), (46, 0.047), (47, 0.25), (48, -0.089), (49, -0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.97187907 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
Author: Kenji Fukumizu, Francis R. Bach, Arthur Gretton
Abstract: While kernel canonical correlation analysis (CCA) has been applied in many contexts, the convergence of finite sample estimates of the associated functions to their population counterparts has not yet been established. This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justification for the method. The proof uses covariance operators defined on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of finite rank to their population counterparts, which can have infinite rank. The result also gives a sufficient condition for convergence on the regularization coefficient involved in kernel CCA: this should decrease as n−1/3 , where n is the number of data. Keywords: canonical correlation analysis, kernel, consistency, regularization, Hilbert space
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classification, consistency, Bayes risk
3 0.30473071 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert
Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT
4 0.28075618 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
Author: Simon Günter, Nicol N. Schraudolph, S. V. N. Vishwanathan
Abstract: We develop gain adaptation methods that improve convergence of the kernel Hebbian algorithm (KHA) for iterative kernel PCA (Kim et al., 2005). KHA has a scalar gain parameter which is either held constant or decreased according to a predetermined annealing schedule, leading to slow convergence. We accelerate it by incorporating the reciprocal of the current estimated eigenvalues as part of a gain vector. An additional normalization term then allows us to eliminate a tuning parameter in the annealing schedule. Finally we derive and apply stochastic meta-descent (SMD) gain vector adaptation (Schraudolph, 1999, 2002) in reproducing kernel Hilbert space to further speed up convergence. Experimental results on kernel PCA and spectral clustering of USPS digits, motion capture and image denoising, and image super-resolution tasks confirm that our methods converge substantially faster than conventional KHA. To demonstrate scalability, we perform kernel PCA on the entire MNIST data set. Keywords: step size adaptation, gain vector adaptation, stochastic meta-descent, kernel Hebbian algorithm, online learning
5 0.24221537 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
6 0.20282349 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
7 0.1993378 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
8 0.18015289 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"
9 0.17320445 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
10 0.13951641 70 jmlr-2007-Ranking the Best Instances
12 0.12429556 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
13 0.12038075 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
14 0.12009774 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
15 0.11238873 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution
17 0.10496356 15 jmlr-2007-Bilinear Discriminant Component Analysis
18 0.10400299 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
19 0.10320396 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics
20 0.10049302 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
topicId topicWeight
[(4, 0.026), (8, 0.062), (12, 0.043), (28, 0.039), (40, 0.039), (45, 0.01), (48, 0.02), (60, 0.07), (77, 0.015), (85, 0.027), (91, 0.416), (98, 0.098)]
simIndex simValue paperId paperTitle
same-paper 1 0.74739015 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
Author: Kenji Fukumizu, Francis R. Bach, Arthur Gretton
Abstract: While kernel canonical correlation analysis (CCA) has been applied in many contexts, the convergence of finite sample estimates of the associated functions to their population counterparts has not yet been established. This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justification for the method. The proof uses covariance operators defined on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of finite rank to their population counterparts, which can have infinite rank. The result also gives a sufficient condition for convergence on the regularization coefficient involved in kernel CCA: this should decrease as n−1/3 , where n is the number of data. Keywords: canonical correlation analysis, kernel, consistency, regularization, Hilbert space
2 0.3099955 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
Author: Jean-Yves Audibert, Olivier Bousquet
Abstract: There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (see Talagrand, 1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds. Keywords: statistical learning theory, PAC-Bayes theorems, generalization error bounds
3 0.3051458 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
Author: Rie Johnson, Tong Zhang
Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian
4 0.30461401 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
5 0.30260861 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
Author: Sanjoy Dasgupta, Leonard Schulman
Abstract: We show that, given data from a mixture of k well-separated spherical Gaussians in R d , a simple two-round variant of EM will, with high probability, learn the parameters of the Gaussians to nearoptimal precision, if the dimension is high (d ln k). We relate this to previous theoretical and empirical work on the EM algorithm. Keywords: expectation maximization, mixtures of Gaussians, clustering, unsupervised learning, probabilistic analysis
6 0.30240822 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
10 0.30019265 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
11 0.29877687 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
12 0.29769745 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
13 0.29713321 43 jmlr-2007-Integrating Naïve Bayes and FOIL
14 0.29686365 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
15 0.29570219 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
16 0.29361349 71 jmlr-2007-Refinable Kernels
17 0.29249638 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
18 0.29245174 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
19 0.29105598 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
20 0.28940362 9 jmlr-2007-AdaBoost is Consistent