jmlr jmlr2010 jmlr2010-82 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
Reference: text
sentIndex sentText sentNum sentScore
1 Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. [sent-11, score-0.441]
2 Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods 1. [sent-13, score-0.278]
3 Introduction A broad variety of methods for machine learning and data analysis from Principal Components Analysis (PCA) to Kernel PCA, Laplacian-based spectral clustering and manifold methods, rely on estimating eigenvalues and eigenvectors of certain data-dependent matrices. [sent-14, score-0.374]
4 In many cases these matrices can be interpreted as empirical versions of underlying integral operators or closely related objects, such as continuous Laplacian operators. [sent-15, score-0.317]
5 Thus establishing connections between empirical operators and their continuous counterparts is essential to understanding these algorithms. [sent-16, score-0.263]
6 In this paper, we propose a method for analyzing empirical operators based on concentration inequalities in Hilbert spaces. [sent-17, score-0.312]
7 Here we develop on this approach to provide a detailed and comprehensive study of perturbation results for empirical estimates of integral operators as well as empirical graph Laplacians. [sent-28, score-0.42]
8 The first study of this problem appeared in Koltchinskii and Gin´ (2000) and Koltchinskii (1998), where the authors consider e integral operators defined by a kernel. [sent-30, score-0.317]
9 In Koltchinskii and Gin´ (2000) the authors study the relae tion between the spectrum of an integral operator with respect to a probability distribution and its (modified) empirical counterpart in the framework of U-statistics. [sent-31, score-0.323]
10 In Koltchinskii (1998) similar results were obtained for convergence of eigenfunctions and, using the triangle inequality, for spectral projections. [sent-35, score-0.348]
11 Finally, convergence e of eigenvalues and eigenfunctions for the case of the uniform distribution was shown in Belkin and Niyogi (2007). [sent-61, score-0.375]
12 (2008), where operators are defined on the space of continuous functions. [sent-66, score-0.263]
13 The analysis is performed in the context of perturbation theory in Banach spaces and bounds 906 O N L EARNING WITH I NTEGRAL O PERATORS on individual eigenfunctions are derived. [sent-67, score-0.301]
14 o More precisely, we show that a finite rank (extension) operator acting on the Reproducing Kernel Hilbert space H defined by K can be naturally associated with the empirical kernel matrix: the two operators have same eigenvalues and related eigenvectors/eigenfunctions. [sent-73, score-0.728]
15 The kernel matrix and its extension can be seen as compositions of suitable restriction and extension operators that are explicitly defined by the kernel. [sent-74, score-0.371]
16 We can use concentration inequalities for operator valued random variables and perturbation results to derive concentration results for eigenvalues (taking into account the multiplicity), as well as for the sums of eigenvalues. [sent-76, score-0.617]
17 Moreover, using a perturbation result for spectral projections, we derive finite sample bounds for the deviation between the spectral projection associated with the k largest eigenvalues. [sent-77, score-0.423]
18 We describe explicitly restriction and extension operators and introduce a finite rank operator with spectral properties related to those of the graph Laplacian. [sent-82, score-0.696]
19 We study behavior of eigenvalues as well as the deviation of the corresponding spectral projections with respect to the Hilbert-Schmidt norm. [sent-84, score-0.377]
20 To obtain explicit estimates for spectral projections we generalize the perturbation result in Zwald and Blanchard (2006) to deal with non-self-adjoint operators. [sent-85, score-0.276]
21 In this case we have to deal with non-self-adjoint operators and the functional analysis becomes more involved. [sent-87, score-0.263]
22 We recall some facts about the properties of linear operators in Hilbert spaces, such as their spectral theory and some perturbation results, and discuss some concentration inequalities in Hilbert spaces. [sent-95, score-0.536]
23 We study concentration of operators obtained by an out-of-sample extension using the kernel function and obtain 907 ROSASCO , B ELKIN AND D E V ITO considerably simplified derivations of several existing results on eigenvalues and eigenfunctions. [sent-98, score-0.571]
24 In fact, in Section 4, we apply these methods to prove convergence of eigenvalues and eigenvectors of the asymmetric graph Laplacian defined by a fixed weight function. [sent-100, score-0.308]
25 The set of bounded operators on H is a Banach space with respect to the operator norm A H ,H = A = sup f =1 A f . [sent-109, score-0.597]
26 If A is a bounded operator, we let A∗ be its adjoint, which is a bounded operator with A∗ = A . [sent-110, score-0.319]
27 A bounded operator A is Hilbert-Schmidt if ∑ j≥1 Ae j 2 < ∞ for some (any) Hilbert basis (e j ) j≥1 . [sent-111, score-0.275]
28 The space of Hilbert-Schmidt operators is also a Hilbert space (a fact which will be a key in our development) endowed with the scalar product A, B HS = ∑ j Ae j , Be j and we denote by · HS the corresponding norm. [sent-112, score-0.263]
29 We say that a bounded operator A √ √ is trace class, if ∑ j≥1 A∗ Ae j , e j < ∞ for some (any) Hilbert basis (e j ) j≥1 (where A∗ A is the square root of the positive operator A∗ A defined by spectral theorem (see, for example, Lang, 1993). [sent-115, score-0.687]
30 The space of trace class √ operators is a Banach space endowed with the norm A TC = Tr( A∗ A). [sent-117, score-0.295]
31 Trace class operators are also Hilbert Schmidt (hence compact). [sent-118, score-0.263]
32 It can also be shown that for any Hilbert-Schmidt operator A and bounded operator B we have AB HS BA HS ≤ ≤ A HS B , B A HS . [sent-120, score-0.506]
33 When A is a bounded operator, A always denotes the operator norm. [sent-123, score-0.275]
34 If the bounded operator A is self-adjoint (A = A∗ , analogous to a hermitian matrix in the finitedimensional case), the eigenvalues are real. [sent-134, score-0.451]
35 A key result, known as the Spectral Theorem, ensures that ∞ A = ∑ λi Pλi , i=1 where Pλ is the orthogonal projection operator onto the eigenspace associated with λ. [sent-136, score-0.281]
36 Remark 2 Though in manifold and spectral learning we typically work with real valued functions, in this paper we will consider complex vector spaces to be able to use certain results from the spectral theory of (possibly non self-adjoint) operators. [sent-143, score-0.362]
37 It is also well known (Aronszajn, 1950) that any conjugate symmetric positive definite kernel K uniquely defines a reproducing kernel Hilbert space whose reproducing kernel is 909 ROSASCO , B ELKIN AND D E V ITO K. [sent-149, score-0.28]
38 Let A : H → H be a compact operator, an extended enumeration is a sequence of real numbers where every nonzero eigenvalue of A appears as many times as its multiplicity and the other values (if any) are zero. [sent-178, score-0.276]
39 The following result due to Kato (1987) is an extension to infinite dimensional operators of an inequality due to Lidskii for finite rank operator. [sent-181, score-0.313]
40 Let (γ j ) j≥1 , be an enumeration of discrete eigenvalues of B − A, then there exist extended enumerations (β j ) j≥1 and (α j ) j≥1 of discrete eigenvalues of B and A respectively such that, ∑ φ(β j − α j ) ≤ ∑ φ(γ j ), j≥1 j≥1 where φ is any nonnegative convex function with φ(0) = 0. [sent-183, score-0.424]
41 j≥1 Given an integer N, let mN be the sum of the multiplicities of the first N nonzero top eigenvalues of A, it is possible to prove that the sequences (α j ) j≥1 and (β j ) j≥1 in the above proposition can be chosen in such a way that α1 ≥ α2 ≥ . [sent-187, score-0.277]
42 To control the spectral projections associated with one or more eigenvalues we need the following perturbation result due to Zwald and Blanchard (2006) (see also Theorem 20 in Section 4. [sent-195, score-0.452]
43 Let A be a positive compact operator such that the number of eigenvalues is infinite. [sent-197, score-0.454]
44 Given N ∈ N, let A PN be the orthogonal projection on the eigenvectors corresponding to the top N distinct eigenvalues α1 > . [sent-198, score-0.275]
45 Given an integer N, if B is another compact positive operator such that A − B ≤ αN −αN+1 , then 4 B A PD − PN ≤ 2 A−B αN − αN+1 911 ROSASCO , B ELKIN AND D E V ITO B where the integer D is such that the dimension of the range of PD is equal to the dimension of the A . [sent-203, score-0.278]
46 If A and B are Hilbert-Schmidt, in the above bound the operator norm can be replaced range of PN by the Hilbert-Schmidt norm. [sent-204, score-0.263]
47 We note that a bound on the projections associated with simple eigenvalues implies that the corresponding eigenvectors are close since, if u and v are taken to be normalized and such that u, v > 0, then the following inequality holds Pu − Pv 2 HS = 2(1 − u, v 2 ) ≥ 2(1 − u, v ) = u − v 2 . [sent-205, score-0.302]
48 Since K(x, x) ≤ κ by ρ assumption, the corresponding integral operator LK : L2 (X, ρ) → L2 (X, ρ) (LK f )(x) = Z K(x, s) f (s)dρ(s) X is a bounded operator. [sent-210, score-0.329]
49 To overcome this difficulty we let H be the RKHS associated with K and define the operators TH , Tn : H → H given by TH = Tn = Z X ·, Kx Kx dρ(x), 1 n ∑ ·, Kxi Kxi . [sent-223, score-0.263]
50 n i=1 (4) (5) Note that TH is the integral operator with kernel K with range and domain H rather than in L2 (X, ρ). [sent-224, score-0.343]
51 H H In the next subsection, we discuss a parallel with the Singular Value Decomposition for matrices and show that TH and LK have the same eigenvalues (possibly, up to some zero eigenvalues) and the corresponding eigenfunctions are closely related. [sent-250, score-0.375]
52 Theorem 7 The operators TH and Tn are Hilbert-Schmidt. [sent-256, score-0.263]
53 n Proof We introduce a sequence (ξi )n of random variables in the Hilbert space of Hilbert-Schmidt i=1 operators by ξi = ·, Kxi Kxi − TH . [sent-258, score-0.263]
54 The same formalism applies more generally to operators and allows us to connect the spectral properties of LK and TH as well as the matrix K and the operator Tn . [sent-297, score-0.643]
55 The operators LK and TH are positive, self-adjoint and trace class. [sent-305, score-0.263]
56 If σ is a nonzero eigenvalue and u, v are the associated eigenfunctions of LK and TH (normalized to norm 1 in L2 (X, ρ) and H ) respectively, then u(x) = v(x) = 1 for ρ-almost all x ∈ X, √ v(x) σj Z 1 K(x, s)u(s)dρ(s) for all x ∈ X. [sent-309, score-0.361]
57 The following decompositions hold: LK g = ∑ σj g, u j ∑ σj f,vj vj j≥1 TH f = j≥1 ρ uj g ∈ L2 (X, ρ), f ∈H, where the eigenfunctions (u j ) j≥1 of LK form an orthonormal basis of ker LK ⊥ and the eigenfunctions (v j ) j≥1 of TH form an orthonormal basis for ker(TH )⊥ . [sent-311, score-0.621]
58 Moreˆ over, if σ is a nonzero eigenvalue and u, v are the corresponding eigenvector and eigenfunction ˆ ˆ of K and Tn (normalized to norm 1 in Cn and H ) respectively, then u = ˆ 1 √ (v(x1 ), . [sent-333, score-0.272]
59 915 ROSASCO , B ELKIN AND D E V ITO Note that in this section LK , T and Tn are self-adjoint operators and K is a conjugate symmetric matrix. [sent-345, score-0.263]
60 Proposition 10 There exist an extended enumeration (σ j ) j≥1 of discrete eigenvalues for LK and an ˆ extended enumeration (σ j ) j≥1 of discrete eigenvalues for K such that ˆ ∑ (σ j − σ j )2 ≤ j≥1 8κ2 τ , n ˆ with confidence greater than 1 − 2e−τ . [sent-350, score-0.44]
61 n Proof By Proposition 8, an extended enumeration (σ j ) j≥1 of discrete eigenvalues for LK is also an extended enumeration (σ j ) j≥1 of discrete eigenvalues for TH , and a similar relation holds for K and Tn by Proposition 9. [sent-352, score-0.44]
62 , um are the eigenfunctions with eigenvalues σ1 , σ2 , . [sent-385, score-0.375]
63 ˆ ROSASCO , B ELKIN AND D E V ITO Since the sum on i is with respect to the eigenfunctions of TH with nonzero eigenvalue, the Mercer ˆ theorem implies that vi , v j H = ui , v j ρ . [sent-398, score-0.315]
64 Finally, observe that ˆ m ∑| i=1 ∑ ˆ ui , v j ρ |2 = PN v j ˆ | ui , v j ρ |2 = ˆ i≥m+1 TH vi =0 2 ρ ∑ ˆ | ui , v j ρ |2 = (I − PN )v j ˆ 2 ρ i≥m+1 LK ui =0 where we used that ker TH ⊂ ker Tn , so that vˆj ∈ ker LK ⊥ with probability 1. [sent-399, score-0.466]
65 To state the properties of H we define the following functions Kx : X → C Kx (t) = K(t, x), Wx : X → R Wx (t) = W (t, x), mn : X → R mn = 1 n ∑ Wx , n i=1 i where mn is the empirical counterpart of the function m and, in particular, mn (xi ) = Dii . [sent-430, score-0.76]
66 Assumption 1 Given a continuous weight function W satisfying (7), we assume there exists a RKHS H with bounded continuous kernel K such that Wx , 1 1 Wx , Wx ∈ H m mn 1 Wx H ≤ C, m for all x ∈ X. [sent-435, score-0.317]
67 Assumption 1 allows to define the following bounded operators LH , AH : H → H AH LH 1 ·, Kx H Wx dρ(x), m X = I − AH = Z and their empirical counterparts Ln , An : H → H 1 n 1 ∑ ·, Kxi H mn Wxi , n i=1 = I − An . [sent-436, score-0.497]
68 An = Ln Next result will show that LH , AH and L have related eigenvalues and eigenfunctions and that eigenvalues and eigenfunctions (eigenvectors) of Ln , An and L are also closely related. [sent-437, score-0.75]
69 We define the restriction operator Rn : H → Cn and the extension operator En : Cn → H as Rn f = ( f (x1 ), . [sent-443, score-0.487]
70 Clearly the extension operator is no longer the adjoint of Rn but the connection among the operators L to Ln and An can still be clarified by means of Rn and En . [sent-453, score-0.547]
71 Similarly the infinite sample restrictions and extension operators can be defined to relate the operators L, AH and LH . [sent-455, score-0.551]
72 The operator AH is Hilbert-Schmidt, the operators L and LH are bounded and have positive eigenvalues. [sent-459, score-0.538]
73 If σ = 1 is an eigenvalue and u, v associated eigenfunctions of L and LH respectively, then u(x) = v(x) for almost all x ∈ X, Z 1 W (x,t) v(x) = u(t) dρ(t) for all x ∈ X. [sent-464, score-0.3]
74 , um is a linearly independent family of eigenfunctions of L with eigenvalues σ1 , . [sent-473, score-0.375]
75 , vm is a linearly independent family of eigenfunctions of LH with eigenvalues σ1 , . [sent-479, score-0.375]
76 Finally, we stress that in item 4 both series converge in the strong operator topology, however, though ∑ j≥1 Pi = I − P0 , it is not true that ∑ j≥1 Qi converges to I − Q0 , where Q0 is the spectral projection of LH associated with the eigenvalue 1. [sent-483, score-0.583]
77 The operator An is Hilbert-Schmidt, the matrix L and the operator Ln have positive eigenvalues. [sent-488, score-0.462]
78 The spectra of L and Ln are the same up to the eigenvalue 1, moreover if σ = 1 is an eigenvalue and the u, v eigenvector and eigenfunction of L and Ln , then ˆ ˆ u = (v(x1 ), . [sent-492, score-0.345]
79 , v(x1 )), ˆ ˆ ˆ n 1 1 W (x, xi ) v(x) = ˆ ˆ ∑ mn (x) ui ˆ 1 − σ n i=1 where ui is the i−th component of the eigenvector u. [sent-495, score-0.269]
80 d+1 Given x ∈ X, clearly Wx ∈ Cb (X) and, by standard results of d+1 differential calculus, both m and mn ∈ Cb (X) with Wx d+1 Cb (X) , m d+1 Cb (X) , mn Leibniz rule for the quotient with bound (11) gives that 1 m d+1 Cb (X) , 1 mn 1 m d+1 Cb (X) 1 mn and d+1 Cb (X) ≤ C1 . [sent-532, score-0.76]
81 is a consequence of the above result with g = m, mn , m , mn , and m − mn , respectively, satisfying g H d+1 ≤ C4 max{C1 ,C2 } = C5 . [sent-543, score-0.57]
82 Next lemma shows that the integral operator of kernel W and its empirical counterpart are HilbertSchmidt operators and it bounds their difference. [sent-545, score-0.606]
83 Proof Item 2 of Lemma 16 ensures that M and Mn are bounded operators on H s with M − Mn H s ,H s ≤ C1 m − mn H d+1 . [sent-556, score-0.497]
84 In the next section we discuss the implications of the above result in terms of concentration of eigenvalues and spectral projections. [sent-568, score-0.374]
85 3 Bounds on Eigenvalues and Spectral Projections Since the operators are no longer self-adjoint the perturbation results in Section 3. [sent-570, score-0.338]
86 λ∈Γ Let B be another compact operator such that B−A ≤ δ2 < δ, δ + ℓ(Γ)/2π where ℓ(Γ) is the length of Γ, then the following facts hold true. [sent-578, score-0.278]
87 ˆ ˆ ˆ ˆ 926 O N L EARNING WITH I NTEGRAL O PERATORS ˆˆ ˆ ˆ Proposition 14 ensures that v is an eigenfunction of An with eigenvalue 1 − σ, so that Qv = v where ˆ ˆ ˆ Q is the spectral projection of An associated with Λ. [sent-617, score-0.383]
88 ROSASCO , B ELKIN AND D E V ITO The last inequality ensures that the largest eigenvalues of An is smaller than 1 + AH , so that by Theorem 20, the curve Γ encloses the first D-eigenvalues of An , where D is equal to the sum of the multiplicity of the first N eigenvalues of AH . [sent-632, score-0.432]
89 O N L EARNING WITH I NTEGRAL O PERATORS ∗ ∗ Both IW IK and IK IW are Hilbert-Schmidt operators since they are composition of a bounded operator and Hilbert-Schmidt operator. [sent-658, score-0.538]
90 Since W is real valued, it follows that we can always choose the eigenfunctions of L as real valued, and, as a consequence, the eigenfunctions of LH . [sent-664, score-0.398]
91 Finally we prove that both L and LH admit a decomposition in terms of spectral projections— we stress that the spectral projection of L is orthogonal in L2 (X, ρW ), but not in L2 (X, ρ). [sent-665, score-0.348]
92 ∗ Since L is a positive operator on L2 (X, ρW ), it is self-adjoint, as well as IK IW , hence the spectral theorem gives ∗ IK IW = ∑ (1 − σ j )Pj j≥1 ∗ where for all j, Pj : L2 (X, ρW ) → L2 (X, ρW ) is the spectral projection of IK IW associated to the eigenvalue 1 − σ j = 0. [sent-666, score-0.712]
93 Moreover note that Pj is also the spectral projection of L associated to the eigenvalue σ j = 1. [sent-667, score-0.3]
94 Let S and T two bounded operators acting on H and defined C = I − ST . [sent-680, score-0.307]
95 Let Γ be a compact subset of the resolvent set of A and define δ−1 = sup (λ − A)−1 , λ∈Γ which is finite since Γ is compact and the resolvent operator (λ − A)−1 is norm continuous (see, for example, Anselone, 1971). [sent-684, score-0.55]
96 If B − A is a Hilbert-Schmidt operator, we can replace the operator norm with the Hilbert-Schmidt norm, and the corresponding inequality is a consequence of the fact that the Hilbert-Schmidt operator are an ideal. [sent-696, score-0.519]
97 However, if A is symmetric, for all i ≥ 1, λi ∈ R, Pλi is an orthogonal projection and Dλi = 0 and it holds that A = ∑ λi Pλi i≥1 where the series converges in operator norm. [sent-707, score-0.281]
98 If A be a compact operator with σ(A) ⊂ [0, A ], we introduce the following notation. [sent-709, score-0.278]
99 We denote by PN the spectral projection A onto all the generalized eigenvectors corresponding to the eigenvalues λ1 , . [sent-714, score-0.424]
100 Learning theory estimates via integral operators and their approximations. [sent-906, score-0.317]
wordName wordTfidf (topN-words)
[('iw', 0.314), ('ah', 0.277), ('operators', 0.263), ('operator', 0.231), ('hs', 0.219), ('eigenfunctions', 0.199), ('mn', 0.19), ('rosasco', 0.185), ('eigenvalues', 0.176), ('lk', 0.173), ('lh', 0.156), ('ik', 0.155), ('pj', 0.153), ('cb', 0.151), ('spectral', 0.149), ('elkin', 0.139), ('perators', 0.129), ('wx', 0.129), ('tn', 0.128), ('ito', 0.118), ('th', 0.114), ('vito', 0.111), ('ntegral', 0.111), ('ker', 0.111), ('sobolev', 0.102), ('eigenvalue', 0.101), ('kx', 0.1), ('hilbert', 0.096), ('kxi', 0.092), ('eigenfunction', 0.083), ('resolvent', 0.083), ('zwald', 0.079), ('perturbation', 0.075), ('pker', 0.074), ('proposition', 0.072), ('nystr', 0.071), ('laplacian', 0.068), ('wxi', 0.065), ('cn', 0.064), ('kernel', 0.058), ('multiplicity', 0.055), ('integral', 0.054), ('rkhs', 0.053), ('reproducing', 0.053), ('item', 0.052), ('projections', 0.052), ('projection', 0.05), ('gin', 0.05), ('concentration', 0.049), ('eigenvectors', 0.049), ('koltchinskii', 0.047), ('compact', 0.047), ('burenkov', 0.046), ('kato', 0.046), ('smale', 0.046), ('earning', 0.044), ('bounded', 0.044), ('enumeration', 0.044), ('belkin', 0.044), ('pn', 0.043), ('orthonormal', 0.043), ('hein', 0.039), ('banach', 0.038), ('luxburg', 0.038), ('spectrum', 0.038), ('valued', 0.037), ('ptn', 0.037), ('recti', 0.037), ('blanchard', 0.035), ('ln', 0.034), ('von', 0.033), ('spectra', 0.033), ('theorem', 0.032), ('tr', 0.032), ('norm', 0.032), ('pth', 0.032), ('dence', 0.031), ('asymmetric', 0.03), ('nonzero', 0.029), ('vi', 0.029), ('qv', 0.028), ('rh', 0.028), ('lafon', 0.028), ('adjoint', 0.028), ('graph', 0.028), ('rn', 0.028), ('anselone', 0.028), ('eigenprojections', 0.028), ('enumerations', 0.028), ('genova', 0.028), ('sup', 0.027), ('spaces', 0.027), ('eigenvector', 0.027), ('ui', 0.026), ('vj', 0.026), ('ae', 0.026), ('proof', 0.025), ('extension', 0.025), ('weight', 0.025), ('inequality', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 82 jmlr-2010-On Learning with Integral Operators
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
2 0.19563422 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
3 0.16233987 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
4 0.086302079 72 jmlr-2010-Matrix Completion from Noisy Entries
Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh
Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization
5 0.080057576 60 jmlr-2010-Learnability, Stability and Uniform Convergence
Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan
Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization
6 0.077553734 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
7 0.077159248 84 jmlr-2010-On Spectral Learning
8 0.074388973 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
9 0.070746101 44 jmlr-2010-Graph Kernels
10 0.059751336 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
11 0.057349816 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
12 0.047298837 69 jmlr-2010-Lp-Nested Symmetric Distributions
13 0.04699862 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
14 0.046704762 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
15 0.044336993 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
16 0.044148665 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting
17 0.042569026 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
18 0.038016584 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
19 0.033862036 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
20 0.033637047 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
topicId topicWeight
[(0, -0.177), (1, -0.107), (2, 0.08), (3, -0.027), (4, -0.242), (5, -0.056), (6, -0.278), (7, 0.058), (8, 0.158), (9, -0.003), (10, -0.215), (11, 0.1), (12, -0.223), (13, 0.047), (14, 0.113), (15, 0.07), (16, 0.048), (17, 0.083), (18, -0.079), (19, 0.072), (20, 0.026), (21, 0.137), (22, 0.038), (23, 0.037), (24, 0.009), (25, 0.093), (26, 0.08), (27, -0.007), (28, -0.015), (29, -0.094), (30, -0.106), (31, 0.048), (32, 0.102), (33, -0.063), (34, -0.061), (35, -0.011), (36, -0.035), (37, 0.068), (38, -0.107), (39, -0.032), (40, 0.022), (41, 0.196), (42, -0.012), (43, -0.116), (44, 0.084), (45, 0.046), (46, 0.077), (47, -0.037), (48, 0.079), (49, 0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.95699447 82 jmlr-2010-On Learning with Integral Operators
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
2 0.69496769 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
3 0.47850004 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
4 0.36045536 60 jmlr-2010-Learnability, Stability and Uniform Convergence
Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan
Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization
5 0.33179912 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet
Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und
6 0.32090315 72 jmlr-2010-Matrix Completion from Noisy Entries
7 0.29902709 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
8 0.29501921 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
9 0.28715566 94 jmlr-2010-Quadratic Programming Feature Selection
10 0.26411369 84 jmlr-2010-On Spectral Learning
11 0.22393213 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
12 0.21629399 69 jmlr-2010-Lp-Nested Symmetric Distributions
13 0.21275087 44 jmlr-2010-Graph Kernels
14 0.1877351 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
15 0.1861905 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
16 0.17810269 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
17 0.17617358 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
18 0.172029 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
19 0.16761872 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
20 0.16313264 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
topicId topicWeight
[(3, 0.079), (4, 0.04), (8, 0.016), (15, 0.014), (21, 0.011), (32, 0.044), (36, 0.017), (37, 0.055), (55, 0.419), (75, 0.153), (81, 0.018), (85, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.68035996 82 jmlr-2010-On Learning with Integral Operators
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
2 0.4018853 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
3 0.3985779 72 jmlr-2010-Matrix Completion from Noisy Entries
Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh
Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization
4 0.3929491 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani
Abstract: We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm S OFT-I MPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity of order linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices; for example S OFT-I MPUTE takes a few hours to compute low-rank approximations of a 106 × 106 incomplete matrix with 107 observed entries, and fits a rank-95 approximation to the full Netflix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques. Keywords: collaborative filtering, nuclear norm, spectral regularization, netflix prize, large scale convex optimization
5 0.38463736 84 jmlr-2010-On Spectral Learning
Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil
Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative filtering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization
6 0.38203028 69 jmlr-2010-Lp-Nested Symmetric Distributions
7 0.38187239 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
8 0.38089779 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
9 0.37380025 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
10 0.37315828 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
11 0.373083 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
12 0.37266737 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
13 0.37258485 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
14 0.37164342 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
15 0.37119612 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
16 0.37072495 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
17 0.36942029 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
18 0.3687669 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
19 0.36607355 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
20 0.3656849 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning