jmlr jmlr2007 jmlr2007-45 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
Reference: text
sentIndex sentText sentNum sentScore
1 Bartlett Abstract Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. [sent-8, score-0.316]
2 We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. [sent-9, score-0.33]
3 This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. [sent-10, score-0.138]
4 Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. [sent-11, score-0.214]
5 It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. [sent-12, score-0.443]
6 Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. [sent-13, score-0.523]
7 Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number 1. [sent-14, score-0.146]
8 V The target function f ρ that we want to learn or approximate is a minimizer (may not be unique) of some error functional E ( f ) = Z Z V (y, f (x))dρ induced by a loss function V : Y ×Y → R+ , that is, V fρ = arg min E ( f ) : f is a measurable function from X to Y . [sent-18, score-0.199]
9 Then the error can be written as E ( f ) = X Y V (y, f (x))dρ(y|x)dρX (x), and we choose V fρ to be a minimizer of the pointwise error: for almost every x ∈ X, V fρ (x) = arg min t∈Y c 2007 Yiming Ying and Ding-Xuan Zhou. [sent-20, score-0.157]
10 This leads to the natural expectation that a minimizer f z of Ez over a set of functions G , called the V hypothesis space, would approximate a minimizer f G of E in G (whose error is close to that of f ρ when G is large): E ( f z ) → E ( fG ) as m → ∞. [sent-24, score-0.196]
11 For the hinge loss φ(yt) = max{1 − yt, 0} used in the support V vector machine for classification (Cortes and Vapnik, 1995), the target function f ρ (x) = fc (x) is called the Bayes rule defined as f c (x) = 1 for ρ(y = 1|x) ≥ ρ(y = −1|x), and f c (x) = −1 otherwise. [sent-30, score-0.148]
12 When Y is a closed interval on R, the uniform convergence of real-valued function space G can be characterized by the property: for every ε > 0, there holds 1 m lim sup Pr sup sup ∑ f (xi ) − →+∞ µ m≥ f ∈G m i=1 Z X f (x)dµ > ε = 0. [sent-36, score-0.717]
13 The reproducing kernel Hilbert space (RKHS) H K associated with the kernel K is defined to be the completion of the linear span of the set of functions {Kx = K(x, ·) : x ∈ X} with the reproducing property (Aronszajn, 1950) f (x) = f , Kx ∀x ∈ X, f ∈ HK . [sent-48, score-0.257]
14 , 2000) associated with the kernel K, loss V and a regularization parameter λ > 0 defined as fz,λ = arg min f ∈HK 1 m ∑ V (yi , f (xi )) + λ f m i=1 250 2 K . [sent-52, score-0.229]
15 In particular, for the least square loss or the hinge loss, it is a convex quadratic programming optimization problem which can handle large data settings. [sent-56, score-0.155]
16 These schemes involve a set of Mercer kernels {K σ }σ∈Σ with an index set Σ and take the following form with the regularization parameter λ>0 1 m (4) fz,λ := arg min min ∑ V (yi , f (xi )) + λ f 2 σ , K σ∈Σ f ∈HK σ m i=1 where the loss function V (y, ·) is usually convex for any y ∈ Y. [sent-66, score-0.345]
17 This leads to the following hypothesis Definition 1 The normalized hypothesis set H associated with Mercer kernels {K σ }σ∈Σ is defined as H = ∪σ∈Σ { f ∈ HK σ : f K σ ≤ 1}. [sent-76, score-0.177]
18 251 Y ING AND Z HOU Theorem 2 Let {K σ }σ∈Σ be a set of Mercer kernels on X with κ := sup sup σ∈Σ x∈X K σ (x, x) < ∞. [sent-82, score-0.541]
19 The reproducing property (3) tells us that Kx K σ = K σ (x, x) ≤ κ for any x ∈ X and σ ∈ Σ. [sent-85, score-0.182]
20 We shall use this idea to establish the learnability of Gaussian kernels with flexible variances. [sent-88, score-0.138]
21 2 uGC Property for Gaussians with Flexible Variances The second purpose of this paper is to verify the learnability of Gaussian kernels with flexible variances stated in the form of the following uGC property. [sent-90, score-0.252]
22 For the neighborhood [0, ε)n of 0 with σ ε > 0, we see that sup sup | fσ (t) − fσ (0)| = sup exp{− σ∈(0,∞) t∈[0,ε)n σ∈(0,∞) nε2 } − 1 = 1 → 0. [sent-115, score-0.66]
23 In the next section we show the implications of the above main theorems to the error analysis for the regularization scheme (4). [sent-119, score-0.172]
24 In particular, we present error rates of two examples involving Gaussian kernels with flexible variances. [sent-120, score-0.181]
25 In Sections 5 and 6 we develop error bounds for the multi-kernel regularization scheme (4), especially for Gassian kernels with flexible variances. [sent-122, score-0.273]
26 We postpone the derivation of error rates for regression with least square loss and classification with hinge loss to the end of this paper. [sent-124, score-0.277]
27 In particular, we expect to derive explicit rates for multi-kernel regularized learning algorithms associated with Gaussian kernels with flexible variances. [sent-128, score-0.183]
28 i=1 The last term called the regularization error is independent of the samples (Niyogi and Girosi, 1 m 253 Y ING AND Z HOU 1996; Cucker and Smale, 2001; Smale and Zhou, 2003) and measures the approximation ability of the multi-kernel space ∪σ∈Σ HK σ . [sent-134, score-0.131]
29 Definition 5 The regularization error associated with the regularizing function f λ ∈ HK σ with σ ∈ Σ is defined as V D (λ) = E ( fλ ) − E ( fρ ) + λ fλ 2 σ . [sent-135, score-0.131]
30 K The regularization error of the system (4) is V D (λ) = min min {E ( f ) − E ( fρ ) + λ f σ∈Σ f ∈HK σ 2 K σ }, (11) where fλ takes the special form V fλ = fλ := arg min min {E ( f ) + λ f σ∈Σ f ∈HK σ 2 K σ }. [sent-136, score-0.278]
31 By the error decomposition (13), the learning rate depends on trading off the sample error and the regularization error. [sent-141, score-0.173]
32 The decay of regularization error D (λ) relies on the regularity (smoothV ness) of the target function f ρ for most commonly used loss functions which will be discussed in Section 6. [sent-142, score-0.206]
33 Let us give two examples both involving the Gaussian kernels (8) with flexible variances to illustrate the learning rates whose proofs will be given in Section 6. [sent-148, score-0.253]
34 The first example is regularized regression with the least square loss V (y,t) = (y − t) 2 . [sent-149, score-0.141]
35 Then the multi-kernel least square regularized regression algorithm (4) associated with Gaussian kernels (8) can be written as fz,λ = arg min min σ∈Σ f ∈HK σ 1 m ∑ (yi − f (xi ))2 + λ f m i=1 2 Kσ . [sent-151, score-0.287]
36 The second example is regularized classification with the hinge loss V (y,t) = (1 − yt) + := max{1 − yt, 0}. [sent-170, score-0.144]
37 The multi-kernel SVM regularized classification algorithm (4) associated with Gaussian kernels (8) is defined to be a minimizer of the following optimization problem fz,λ = arg min min σ∈Σ f ∈HK σ 1 m ∑ (1 − yi f (xi ))+ + λ f m i=1 2 Kσ . [sent-173, score-0.349]
38 The error analysis of classification algorithms often aims at understanding the approximating behaviors of the excess misclassification error R (sgn( fz,λ )) − R ( fc ) as the sample size m becomes large. [sent-178, score-0.132]
39 Our learning rate for the hinge loss assumes a separable condition which was introduced by Chen et al. [sent-179, score-0.139]
40 Definition 6 We say that ρ is separable by HΣ if there is some f sp ∈ HK σ with some σ ∈ Σ such that y fsp (x) > 0 almost surely. [sent-181, score-0.142]
41 It has separation exponent θ ∈ (0, ∞] if we can choose f sp and positive constants ∆, cθ such that f sp K σ = 1 and ρX x ∈ X : | fsp (x)| < ∆t ≤ cθt θ , 255 ∀t > 0. [sent-182, score-0.167]
42 , 1998; Vapnik, 1998) that ρ is said to be strictly separable with margin γ > 0 if ρ is separable together with the requirement y fsp (x) ≥ γ almost everywhere. [sent-187, score-0.144]
43 It is observed in applications that allowing flexible variances improves the learnability of Gaussian kernels. [sent-193, score-0.151]
44 As we shall see in Section 6 for the least square loss, when the regression function fρ has Sobolev smoothness, the regularization error (11) associated with Gaussians with flexible variances decays as O (λs ) for some s > 0. [sent-194, score-0.334]
45 Such a decay is impossible for the regularization error associated with a single Gaussian kernel, at least when ρX is the Lebesgue measure on X, as shown by Smale and Zhou (2003). [sent-196, score-0.164]
46 This demonstrates that learning algorithms using Gaussian kernels with flexible variances have advantages for many applications. [sent-197, score-0.215]
47 Another way to obtain improved error rates is to take kernels changing with the sample size. [sent-198, score-0.181]
48 Kernels of this kind include polynomial kernels with changing degrees (Zhou and Jetter, 2006) and Gaussian kernels with changing variances (Steinwart, 2001; Steinwart and Scovel, 2005). [sent-199, score-0.316]
49 (a) Gσ ∈ Hσ and sup f ∈Hσ , f (b) Gσ and Gσ σ ≤ 2 Gσ C(X) σ ≤1 |Em ( f ) − E( f )| = Gσ C(X) ≤ κσ Gσ σ. [sent-225, score-0.22]
50 This in connection with (3) i=1 and property (b) of Proposition 7 tells us that for any f ∈ Hσ , Em ( f ) − E( f ) = f , 1 m σ ∑ K − Fσ m i=1 xi Then sup f ∈Hσ , f |Em ( f ) − E( f )| = σ ≤1 sup f ∈Hσ , f 257 σ = f , Gσ σ . [sent-228, score-0.625]
51 So we have Gσ m 2 σ = ∑ 1 m i=1 − Z X 1 m m ∑ K σ (xi , x j ) − F σ (xi ) j=1 1 m σ ∑ K (xi , y) − F σ (y) dµ(y) m i=1 1 m = ∑ Gσ (xi ) − Gσ (y)dµ(y) m i=1 X ≤ 2 sup |Gσ (y)| = 2 Gσ C(X) . [sent-234, score-0.22]
52 By Part (a) of Lemma 8, sup |Em ( f ) − E( f )| = sup f ∈H sup σ∈Σ f ∈Hσ , f |Em ( f ) − E( f )| = sup Gσ σ ≤1 σ∈Σ σ. [sent-241, score-0.88]
53 σ On the other hand, since {Ky : σ ∈ Σ, y ∈ X} is exactly the set F according to its definition (7), we see that 1 m σ sup |Em ( f ) − E( f )| = sup sup ∑ K (xi ) − F σ (y) = sup Gσ C(X) . [sent-242, score-0.88]
54 m i=1 y σ∈Σ y∈X σ∈Σ f ∈F This in connection with Lemma 8 implies that for any ε > 0 and ∈ N, there holds supm≥ sup f ∈F |Em ( f ) − E( f )| > κε ⊆ supm≥ sup f ∈H |Em ( f ) − E( f )| > ε ⊆ supm≥ sup f ∈F |Em ( f ) − E( f )| > Therefore, H is uGC if and only if F is. [sent-243, score-0.66]
55 Gaussian Kernels Provide uGC Hypothesis Sets In this section we establish the uGC property of Gaussians with flexible variances stated in Theorem 3. [sent-247, score-0.141]
56 The metric entropy of G is defined as Hm,p (G , η) = sup log N p (G , x, η), x∈X m m ∈ N, η > 0. [sent-253, score-0.22]
57 γ 259 Y ING AND Z HOU (b) For every m ∈ N and 0 < ε < 1, there holds 4m sup N∞ (G , x, ε) ≤ 2 2 ε x∈X m 1+d log( 2em ) ε , d := Pε/4 (G ). [sent-268, score-0.22]
58 Applying Lemma 12 to G = F j of functions on R, we find that for every 0 < ε < 1, Pε/4 (F j ) ≤ 16 + 2 and ε 4m sup N∞ (F , t, ε) ≤ N0 := 2 2 ε t∈Rm j 260 1+( 16 +2) log( 2em ) ε ε . [sent-301, score-0.22]
59 To this end, let 0 < ε < 1 and x = (xi )m where each point xi ∈ Rn can be expressed as a vector xi = (xi,1 , . [sent-303, score-0.154]
60 Error Bound by Rademacher Averages V In order to bound the error E ( f z,σ ) − E ( fρ ), we know from the error decomposition (13) that it is sufficient to estimate the sample error and the regularization error. [sent-355, score-0.215]
61 The regularization error will be discussed in the next subsection. [sent-359, score-0.131]
62 The estimate of the sample error for the regularized multi-kernel scheme (4) involves the hypothesis space Hλ = [ σ∈Σ f ∈ HKσ : f Kσ ≤ M . [sent-360, score-0.165]
63 For every integer m, let m 1 sup ∑ εi f (zi ) Rm (F) := Eµ Eε m f ∈F i=1 where {zi }m are independent random variables distributed according to µ and {ε i }m are indei=1 i=1 pendent Rademacher random variables, that is, P(ε i = +1) = P(εi = −1) = 1/2. [sent-364, score-0.22]
64 Definition 15 We say that the loss function V : Y × R → R+ is admissible (with respect to ρ) if it ∞ is convex with respect to the second variable, M = V (y, 0) Lρ (Z) < +∞ and, for any λ > 0 there holds M Cλ = sup max(|V− (y,t)|, |V+ (y,t)|) : y ∈ Y, |t| ≤ κ < +∞. [sent-368, score-0.298]
65 , m}, φi : R → R is a function with φi (0) = 0 having a Lipschitz constant ci , then for any {xi }m , i=1 m Eε sup | ∑ εi φi ( f (xi ))| ≤ 2Eε sup m ∑ ci εi f (xi )| . [sent-376, score-0.44]
66 f ∈F i=1 f ∈F i=1 The sample error analysis of multi-kernel regularization scheme (4) involves the Rademacher complexity of the fundamental space F denoted by Rm (F ) = EρX Eε m 1 sup | ∑ εi f (xi )| . [sent-377, score-0.392]
67 m f ∈F i=1 Lemma 17 Let Hλ be defined by (23), then M Rm (F ) λ E sup |E ( f ) − Ez ( f )| ≤ 4Cλ f ∈Hλ 1/2 2M +√ . [sent-378, score-0.22]
68 f ∈Hλ m i=1 Eρ Eε sup |E ( f ) − Ez ( f )| ≤ 2Eρ Eε sup f ∈Hλ To handle Rm (Vλ ), we apply Lemma 16. [sent-385, score-0.44]
69 i, j=1 It follows that m Eε sup f ∈Hλ ∑ εiV (yi , f (xi )) i=1 m ≤ 2Cλ M λ Eε sup f ∈H | ∑ εi f (xi )| √ +M m. [sent-389, score-0.44]
70 But by the definition of the fundamental set F , we know that m supσ∈Σ ∑mj=1 εi K σ (xi , x j )ε j ≤ m supσ∈Σ sups∈X | ∑ εi K σ (xi , s)| i, m i=1 = m sup f ∈F | ∑ εi f (xi )|. [sent-391, score-0.22]
71 (27) i=1 Combining the estimates (25), (26), and (27) implies that m Rm (Vλ ) ≤ 2Cλ M λ ≤ 2Cλ M λ EρX Eε 1 m Rm (F ) sup f ∈F | ∑ εi f (xi )| 1/2 i=1 + 1/2 M +√ m M √ . [sent-392, score-0.22]
72 The following error bound for the multi-kernel scheme (4) is a straightforward consequence of the sample error estimate and the error decomposition (13). [sent-394, score-0.167]
73 V V Together with the fact E(Ez ( fλ )) = E ( fλ ) and fz,λ ∈ Hλ defined in (23), we know that V V E E ( fz,λ ) − Ez ( fz,λ ) + Ez ( fλ ) − E ( fλ ) ≤ E sup |E ( f ) − Ez ( f )| f ∈Hλ which in connection with Lemma 17 yields the desired estimate. [sent-399, score-0.22]
74 Then there exists an absolute constant C such that, for any sample {xi }m , there holds i=1 m 1 √ Eε sup | ∑ εi f (xi )| ≤ C m f ∈F i=1 Z +∞ 0 log N2 (F, x, ε)dε. [sent-402, score-0.22]
75 Applying the estimate in the proof for Theorem 3, we can compute the Radmacher average Rm (F ) for Gaussian kernels with flexible variances, and hence yields the subsequent error bound. [sent-403, score-0.17]
76 In order to get explicit error rates, we see from Theorems 18 and 20 that what left is to estimate the regularization error. [sent-413, score-0.131]
77 2 Regularization Error with Gaussians In this subsection we exclusively focus on the multi-kernel regularization error (11) associated with the least square loss and Gaussian kernels. [sent-415, score-0.228]
78 We show that how the Fourier analysis (Stein, 1970) can be applied to get the polynomial decay of the regularization error under Sobolev smoothness condition on the regression function. [sent-416, score-0.164]
79 Before we go to the main point, it is worth briefly mentioning why the multi-kernel regularization error can improve the error rates. [sent-417, score-0.173]
80 To this end, note that, for the regularization error of a single Gaussian kernel, it was proved by Smale and Zhou (2003) that the polynomial decay O (λ s ) for some s > 0 is impossible under the Sobolev smoothness hypothesis on the regression function f ρ . [sent-418, score-0.202]
81 Below we show that the multi-kernel scheme (4) associated with the least square loss and Gaussian kernels (9) with flexible isotropic variances can give regularization errors D (λ) of polynomial decays O (λβ ) for some β > 0, under the assumption of Sobolev smoothness on f ρ . [sent-421, score-0.476]
82 To estimate the multi-kernel regularization error (11), we introduce some basic facts about RKHS (Aronszajn, 1950). [sent-423, score-0.131]
83 (32) Proof We shall use notations and results on limits of reproducing kernels (see Theorem I in Section 9 of Aronszajn (1950)). [sent-438, score-0.175]
84 f ∗ 2 K (E j ) = lim →∞ f ∗ , fσ, j, j |E j HKσ (E j ) H σ ≤ f ∗ HKσ (E j ) lim inf →∞ fσ, j, j |E j HKσ (E j ) which tells us that √ f ∗ HKσ (E j ) ≤ lim inf fσ, j, j |E j HKσ (E j ) ≤ ( πσ)n/2 f →∞ L2 (Rn ) . [sent-451, score-0.332]
85 Consider the multi-kernel scheme (4) with the least square loss V (y,t) = (y − t) 2 and the Gaussian kernels (9) with Σ = (0, ∞). [sent-456, score-0.239]
86 Finally, we are able to derive error rates for the multi-kernel regularization scheme (4) associated with Gaussian kernels (8) with flexible variances. [sent-489, score-0.311]
87 This is done by putting the improved regularization error bound in Proposition 22 into the total error bound in Theorem 20. [sent-490, score-0.173]
88 We demonstrate the approach for regression with least square loss and for classification with hinge loss. [sent-491, score-0.155]
89 If |y| ≤ M0 almost surely, the least square loss V (y, s) = √ 2 ∞ (y − s)2 is admissible with respect to ρ and M = V (y, 0) Lρ (Z) ≤ M0 , Cλ ≤ 2M0 (1 + κ/ λ). [sent-495, score-0.133]
90 Since the set (8) of Gaussian kernels with general variances σ = (σ 1 , . [sent-502, score-0.215]
91 , σn ) ∈ (0, ∞)n contains the set (9) of Gaussian kernels with isotropic variances, we see that D (λ) ≤ inf inf σ∈(0,∞) f ∈HKσ E ( f ) − E ( fρ ) + λ f 2 Kσ . [sent-505, score-0.181]
92 Now we move on to establish Example 2 for the regularized classification with the hinge loss V (y, s) = (1 − ys)+ . [sent-511, score-0.144]
93 An important relation between the excess misclassification error and the excess error was given by Zhang (2004) as R (sgn( f )) − R ( fc ) ≤ E ( f ) − E ( fc ), ∀ f : X → R. [sent-514, score-0.18]
94 (1997) as 2 2 log y+1 , sup N∞ (G , x, ε) ≤ 2 m ε x∈X m where d0 y=∑ i=1 m i 2 ε i ≤ 2 ε d0 em d0 d0 ≤ 2em ε d0 ≤ 2em ε d . [sent-545, score-0.267]
95 It tells us the following result: If T is a bounded subset of Rm , each function φi with the Lipschitz constant not more than 1 satisfies φi (0) = 0, and a function G : R+ → R+ is convex and nondecreasing, then there holds m m 1 (44) Eε G sup | ∑ εi φi (ti )| ≤ Eε G sup | ∑ εiti | . [sent-551, score-0.521]
96 Applying (44) with G(u) = u and φi (x) = φi (x/ci ), we have m Eε sup f ∈F | ∑ εi φi ( f (xi ))| i=1 m m = Eε supt∈T | ∑ εi φi (ti )| ≤ 2Eε sup | ∑ εiti | i=1 m = 2Eε sup f ∈F which proves our second statement. [sent-554, score-0.66]
97 K In this appendix, we verify the existence for the scheme (4) involving the least-square loss and Gaussians (9) with flexible variances under some mild conditions on the sample z. [sent-559, score-0.236]
98 σ2 Consider the scheme with 0 < b < ∞ fz,λ := arg min min σ∈(0,b] f ∈HK σ 1 m ∑ ( f (xi ) − yi )2 + λ f m i=1 2 Kσ . [sent-567, score-0.187]
99 The above existence result largely depends on the least square loss and the assumption on the data. [sent-606, score-0.136]
100 It remains an open problem on how to prove the existence of the minimizer of the multi-kernel scheme (4) associated with general loss functions and data. [sent-607, score-0.18]
wordName wordTfidf (topN-words)
[('fz', 0.409), ('hk', 0.406), ('ez', 0.248), ('ugc', 0.234), ('sup', 0.22), ('rn', 0.217), ('hou', 0.185), ('aussians', 0.171), ('earnability', 0.171), ('fe', 0.167), ('lk', 0.136), ('rademacher', 0.119), ('variances', 0.114), ('kernels', 0.101), ('exible', 0.095), ('kx', 0.093), ('smale', 0.091), ('regularization', 0.089), ('alon', 0.082), ('tells', 0.081), ('ej', 0.078), ('xi', 0.077), ('ing', 0.077), ('gaussians', 0.074), ('zhou', 0.074), ('reproducing', 0.074), ('ky', 0.067), ('ying', 0.067), ('fsp', 0.066), ('rm', 0.065), ('borel', 0.064), ('lemma', 0.064), ('proposition', 0.06), ('kxi', 0.06), ('yi', 0.059), ('hinge', 0.058), ('minimizer', 0.058), ('im', 0.057), ('mercer', 0.057), ('covering', 0.057), ('lim', 0.057), ('yt', 0.056), ('square', 0.055), ('sgn', 0.053), ('sobolev', 0.053), ('gaussian', 0.051), ('scovel', 0.05), ('fc', 0.048), ('em', 0.047), ('cucker', 0.046), ('rkhs', 0.046), ('dt', 0.045), ('ledoux', 0.045), ('regularized', 0.044), ('steinwart', 0.043), ('error', 0.042), ('theorem', 0.042), ('loss', 0.042), ('kernel', 0.041), ('scheme', 0.041), ('kong', 0.04), ('shattered', 0.04), ('inf', 0.04), ('lebesgue', 0.04), ('supm', 0.04), ('separable', 0.039), ('lipschitz', 0.039), ('existence', 0.039), ('rates', 0.038), ('hypothesis', 0.038), ('dudley', 0.038), ('closure', 0.037), ('learnability', 0.037), ('compact', 0.037), ('sp', 0.037), ('koltchinskii', 0.037), ('admissible', 0.036), ('decays', 0.034), ('hong', 0.034), ('fourier', 0.034), ('aronszajn', 0.034), ('yiming', 0.033), ('mj', 0.033), ('wu', 0.033), ('decay', 0.033), ('statement', 0.032), ('bartlett', 0.031), ('min', 0.03), ('evgeniou', 0.03), ('panchenko', 0.03), ('stein', 0.03), ('dx', 0.027), ('talagrand', 0.027), ('dy', 0.027), ('exponent', 0.027), ('proof', 0.027), ('property', 0.027), ('arg', 0.027), ('schemes', 0.026), ('arzel', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
2 0.26561344 71 jmlr-2007-Refinable Kernels
Author: Yuesheng Xu, Haizhang Zhang
Abstract: Motivated by mathematical learning from training data, we introduce the notion of refinable kernels. Various characterizations of refinable kernels are presented. The concept of refinable kernels leads to the introduction of wavelet-like reproducing kernels. We also investigate a refinable kernel that forms a Riesz basis. In particular, we characterize refinable translation invariant kernels, and refinable kernels defined by refinable functions. This study leads to multiresolution analysis of reproducing kernel Hilbert spaces. Keywords: refinable kernels, refinable feature maps, wavelet-like reproducing kernels, dual kernels, learning with kernels, reproducing kernel Hilbert spaces, Riesz bases
3 0.23551781 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert
Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT
4 0.13254808 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
5 0.12515453 9 jmlr-2007-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency
6 0.10852787 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
7 0.091224894 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
8 0.079521008 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
9 0.066794641 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
10 0.062882818 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
12 0.057095967 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
13 0.056138489 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
14 0.055178989 70 jmlr-2007-Ranking the Best Instances
15 0.054301061 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
16 0.048303511 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
17 0.048091758 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
18 0.046154745 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
19 0.045363769 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
20 0.04400865 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
topicId topicWeight
[(0, 0.291), (1, -0.454), (2, 0.307), (3, 0.109), (4, 0.091), (5, 0.078), (6, 0.176), (7, -0.06), (8, -0.205), (9, -0.034), (10, -0.062), (11, 0.019), (12, 0.037), (13, 0.055), (14, -0.064), (15, -0.015), (16, 0.029), (17, -0.023), (18, 0.018), (19, -0.021), (20, -0.085), (21, -0.067), (22, -0.074), (23, -0.016), (24, 0.016), (25, -0.003), (26, -0.072), (27, 0.009), (28, 0.025), (29, -0.01), (30, 0.093), (31, -0.004), (32, -0.088), (33, -0.05), (34, 0.023), (35, 0.039), (36, 0.024), (37, 0.03), (38, 0.004), (39, 0.018), (40, 0.003), (41, 0.026), (42, 0.016), (43, -0.003), (44, 0.002), (45, -0.018), (46, 0.021), (47, -0.065), (48, 0.046), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.94449741 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
2 0.93698937 71 jmlr-2007-Refinable Kernels
Author: Yuesheng Xu, Haizhang Zhang
Abstract: Motivated by mathematical learning from training data, we introduce the notion of refinable kernels. Various characterizations of refinable kernels are presented. The concept of refinable kernels leads to the introduction of wavelet-like reproducing kernels. We also investigate a refinable kernel that forms a Riesz basis. In particular, we characterize refinable translation invariant kernels, and refinable kernels defined by refinable functions. This study leads to multiresolution analysis of reproducing kernel Hilbert spaces. Keywords: refinable kernels, refinable feature maps, wavelet-like reproducing kernels, dual kernels, learning with kernels, reproducing kernel Hilbert spaces, Riesz bases
3 0.76495647 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert
Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT
4 0.46220571 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
5 0.37882879 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
Author: Yann Guermeur
Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classifiers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classifiers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection
6 0.37192464 9 jmlr-2007-AdaBoost is Consistent
7 0.30851552 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
8 0.28063798 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
10 0.26518753 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
11 0.26300368 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
13 0.22707106 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
14 0.21976078 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
15 0.21883874 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
16 0.20148368 70 jmlr-2007-Ranking the Best Instances
17 0.19993041 44 jmlr-2007-Large Margin Semi-supervised Learning
18 0.19801968 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
19 0.19476962 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
20 0.19307539 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
topicId topicWeight
[(4, 0.012), (8, 0.043), (12, 0.025), (28, 0.599), (40, 0.045), (48, 0.014), (60, 0.063), (85, 0.018), (98, 0.084)]
simIndex simValue paperId paperTitle
1 0.94329137 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
Author: David Mease, Abraham J. Wyner, Andreas Buja
Abstract: The standard by which binary classifiers are usually judged, misclassification error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classification trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We first examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overfit P[y = 1|x] even though they perform well as classifiers. A major negative point of the present article is the disconnect between class probability estimation and classification. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries. We then use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratification, data jittering
same-paper 2 0.93232077 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
3 0.85789257 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
Author: Marta Arias, Roni Khardon, Jérôme Maloberti
Abstract: The paper introduces L OG A N -H —a system for learning first-order function-free Horn expressions from interpretations. The system is based on an algorithm that learns by asking questions and that was proved correct in previous work. The current paper shows how the algorithm can be implemented in a practical system, and introduces a new algorithm based on it that avoids interaction and learns from examples only. The L OG A N -H system implements these algorithms and adds several facilities and optimizations that allow efficient applications in a wide range of problems. As one of the important ingredients, the system includes several fast procedures for solving the subsumption problem, an NP-complete problem that needs to be solved many times during the learning process. We describe qualitative and quantitative experiments in several domains. The experiments demonstrate that the system can deal with varied problems, large amounts of data, and that it achieves good classification accuracy. Keywords: inductive logic programming, subsumption, bottom-up learning, learning with queries
4 0.61027396 71 jmlr-2007-Refinable Kernels
Author: Yuesheng Xu, Haizhang Zhang
Abstract: Motivated by mathematical learning from training data, we introduce the notion of refinable kernels. Various characterizations of refinable kernels are presented. The concept of refinable kernels leads to the introduction of wavelet-like reproducing kernels. We also investigate a refinable kernel that forms a Riesz basis. In particular, we characterize refinable translation invariant kernels, and refinable kernels defined by refinable functions. This study leads to multiresolution analysis of reproducing kernel Hilbert spaces. Keywords: refinable kernels, refinable feature maps, wavelet-like reproducing kernels, dual kernels, learning with kernels, reproducing kernel Hilbert spaces, Riesz bases
5 0.59357947 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
Author: Francesco Dinuzzo, Marta Neve, Giuseppe De Nicolao, Ugo Pietro Gianazza
Abstract: Support Vector Regression (SVR) for discrete data is considered. An alternative formulation of the representer theorem is derived. This result is based on the newly introduced notion of pseudoresidual and the use of subdifferential calculus. The representer theorem is exploited to analyze the sensitivity properties of ε-insensitive SVR and introduce the notion of approximate degrees of freedom. The degrees of freedom are shown to play a key role in the evaluation of the optimism, that is the difference between the expected in-sample error and the expected empirical risk. In this way, it is possible to define a C p -like statistic that can be used for tuning the parameters of SVR. The proposed tuning procedure is tested on a simulated benchmark problem and on a real world problem (Boston Housing data set). Keywords: statistical learning, reproducing kernel Hilbert spaces, support vector machines, representer theorem, regularization theory
6 0.56565922 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
9 0.50182092 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
10 0.49662104 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
11 0.48252508 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
12 0.47870547 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
13 0.47708446 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
14 0.47521803 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
15 0.47221819 44 jmlr-2007-Large Margin Semi-supervised Learning
16 0.47166216 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
17 0.4687402 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
18 0.46818775 77 jmlr-2007-Stagewise Lasso
19 0.46651182 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
20 0.46581081 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression