jmlr jmlr2007 jmlr2007-45 knowledge-graph by maker-knowledge-mining

45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Source: pdf

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with ﬂexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with ﬂexible variances is a uniform Glivenko-Cantelli (uGC) class. This result conﬁrms a conjecture concerning learnability of Gaussian kernels and veriﬁes the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when ﬂexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with ﬂexible variances we present explicit learning rates for regression with least square loss and classiﬁcation with hinge loss. Keywords: Gaussian kernel, ﬂexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Bartlett Abstract Gaussian kernels with ﬂexible variances provide a rich family of Mercer kernels for learning algorithms. [sent-8, score-0.316]

2 We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with ﬂexible variances is a uniform Glivenko-Cantelli (uGC) class. [sent-9, score-0.33]

3 This result conﬁrms a conjecture concerning learnability of Gaussian kernels and veriﬁes the uniform convergence of many learning algorithms involving Gaussians with changing variances. [sent-10, score-0.138]

4 Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. [sent-11, score-0.214]

5 It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when ﬂexible variances are allowed. [sent-12, score-0.443]

6 Finally, for regularization schemes generated by Gaussian kernels with ﬂexible variances we present explicit learning rates for regression with least square loss and classiﬁcation with hinge loss. [sent-13, score-0.523]

7 Keywords: Gaussian kernel, ﬂexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number 1. [sent-14, score-0.146]

8 V The target function f ρ that we want to learn or approximate is a minimizer (may not be unique) of some error functional E ( f ) = Z Z V (y, f (x))dρ induced by a loss function V : Y ×Y → R+ , that is, V fρ = arg min E ( f ) : f is a measurable function from X to Y . [sent-18, score-0.199]

9 Then the error can be written as E ( f ) = X Y V (y, f (x))dρ(y|x)dρX (x), and we choose V fρ to be a minimizer of the pointwise error: for almost every x ∈ X, V fρ (x) = arg min t∈Y c 2007 Yiming Ying and Ding-Xuan Zhou. [sent-20, score-0.157]

10 This leads to the natural expectation that a minimizer f z of Ez over a set of functions G , called the V hypothesis space, would approximate a minimizer f G of E in G (whose error is close to that of f ρ when G is large): E ( f z ) → E ( fG ) as m → ∞. [sent-24, score-0.196]

11 For the hinge loss φ(yt) = max{1 − yt, 0} used in the support V vector machine for classiﬁcation (Cortes and Vapnik, 1995), the target function f ρ (x) = fc (x) is called the Bayes rule deﬁned as f c (x) = 1 for ρ(y = 1|x) ≥ ρ(y = −1|x), and f c (x) = −1 otherwise. [sent-30, score-0.148]

12 When Y is a closed interval on R, the uniform convergence of real-valued function space G can be characterized by the property: for every ε > 0, there holds 1 m lim sup Pr sup sup ∑ f (xi ) − →+∞ µ m≥ f ∈G m i=1 Z X f (x)dµ > ε = 0. [sent-36, score-0.717]

13 The reproducing kernel Hilbert space (RKHS) H K associated with the kernel K is deﬁned to be the completion of the linear span of the set of functions {Kx = K(x, ·) : x ∈ X} with the reproducing property (Aronszajn, 1950) f (x) = f , Kx ∀x ∈ X, f ∈ HK . [sent-48, score-0.257]

14 , 2000) associated with the kernel K, loss V and a regularization parameter λ > 0 deﬁned as fz,λ = arg min f ∈HK 1 m ∑ V (yi , f (xi )) + λ f m i=1 250 2 K . [sent-52, score-0.229]

15 In particular, for the least square loss or the hinge loss, it is a convex quadratic programming optimization problem which can handle large data settings. [sent-56, score-0.155]

16 These schemes involve a set of Mercer kernels {K σ }σ∈Σ with an index set Σ and take the following form with the regularization parameter λ>0 1 m (4) fz,λ := arg min min ∑ V (yi , f (xi )) + λ f 2 σ , K σ∈Σ f ∈HK σ m i=1 where the loss function V (y, ·) is usually convex for any y ∈ Y. [sent-66, score-0.345]

17 This leads to the following hypothesis Deﬁnition 1 The normalized hypothesis set H associated with Mercer kernels {K σ }σ∈Σ is deﬁned as H = ∪σ∈Σ { f ∈ HK σ : f K σ ≤ 1}. [sent-76, score-0.177]

18 251 Y ING AND Z HOU Theorem 2 Let {K σ }σ∈Σ be a set of Mercer kernels on X with κ := sup sup σ∈Σ x∈X K σ (x, x) < ∞. [sent-82, score-0.541]

19 The reproducing property (3) tells us that Kx K σ = K σ (x, x) ≤ κ for any x ∈ X and σ ∈ Σ. [sent-85, score-0.182]

20 We shall use this idea to establish the learnability of Gaussian kernels with ﬂexible variances. [sent-88, score-0.138]

21 2 uGC Property for Gaussians with Flexible Variances The second purpose of this paper is to verify the learnability of Gaussian kernels with ﬂexible variances stated in the form of the following uGC property. [sent-90, score-0.252]

22 For the neighborhood [0, ε)n of 0 with σ ε > 0, we see that sup sup | fσ (t) − fσ (0)| = sup exp{− σ∈(0,∞) t∈[0,ε)n σ∈(0,∞) nε2 } − 1 = 1 → 0. [sent-115, score-0.66]

23 In the next section we show the implications of the above main theorems to the error analysis for the regularization scheme (4). [sent-119, score-0.172]

24 In particular, we present error rates of two examples involving Gaussian kernels with ﬂexible variances. [sent-120, score-0.181]

25 In Sections 5 and 6 we develop error bounds for the multi-kernel regularization scheme (4), especially for Gassian kernels with ﬂexible variances. [sent-122, score-0.273]

26 We postpone the derivation of error rates for regression with least square loss and classiﬁcation with hinge loss to the end of this paper. [sent-124, score-0.277]

27 In particular, we expect to derive explicit rates for multi-kernel regularized learning algorithms associated with Gaussian kernels with ﬂexible variances. [sent-128, score-0.183]

28 i=1 The last term called the regularization error is independent of the samples (Niyogi and Girosi, 1 m 253 Y ING AND Z HOU 1996; Cucker and Smale, 2001; Smale and Zhou, 2003) and measures the approximation ability of the multi-kernel space ∪σ∈Σ HK σ . [sent-134, score-0.131]

29 Deﬁnition 5 The regularization error associated with the regularizing function f λ ∈ HK σ with σ ∈ Σ is deﬁned as V D (λ) = E ( fλ ) − E ( fρ ) + λ fλ 2 σ . [sent-135, score-0.131]

30 K The regularization error of the system (4) is V D (λ) = min min {E ( f ) − E ( fρ ) + λ f σ∈Σ f ∈HK σ 2 K σ }, (11) where fλ takes the special form V fλ = fλ := arg min min {E ( f ) + λ f σ∈Σ f ∈HK σ 2 K σ }. [sent-136, score-0.278]

31 By the error decomposition (13), the learning rate depends on trading off the sample error and the regularization error. [sent-141, score-0.173]

32 The decay of regularization error D (λ) relies on the regularity (smoothV ness) of the target function f ρ for most commonly used loss functions which will be discussed in Section 6. [sent-142, score-0.206]

33 Let us give two examples both involving the Gaussian kernels (8) with ﬂexible variances to illustrate the learning rates whose proofs will be given in Section 6. [sent-148, score-0.253]

34 The ﬁrst example is regularized regression with the least square loss V (y,t) = (y − t) 2 . [sent-149, score-0.141]

35 Then the multi-kernel least square regularized regression algorithm (4) associated with Gaussian kernels (8) can be written as fz,λ = arg min min σ∈Σ f ∈HK σ 1 m ∑ (yi − f (xi ))2 + λ f m i=1 2 Kσ . [sent-151, score-0.287]

36 The second example is regularized classiﬁcation with the hinge loss V (y,t) = (1 − yt) + := max{1 − yt, 0}. [sent-170, score-0.144]

37 The multi-kernel SVM regularized classiﬁcation algorithm (4) associated with Gaussian kernels (8) is deﬁned to be a minimizer of the following optimization problem fz,λ = arg min min σ∈Σ f ∈HK σ 1 m ∑ (1 − yi f (xi ))+ + λ f m i=1 2 Kσ . [sent-173, score-0.349]

38 The error analysis of classiﬁcation algorithms often aims at understanding the approximating behaviors of the excess misclassiﬁcation error R (sgn( fz,λ )) − R ( fc ) as the sample size m becomes large. [sent-178, score-0.132]

39 Our learning rate for the hinge loss assumes a separable condition which was introduced by Chen et al. [sent-179, score-0.139]

40 Deﬁnition 6 We say that ρ is separable by HΣ if there is some f sp ∈ HK σ with some σ ∈ Σ such that y fsp (x) > 0 almost surely. [sent-181, score-0.142]

41 It has separation exponent θ ∈ (0, ∞] if we can choose f sp and positive constants ∆, cθ such that f sp K σ = 1 and ρX x ∈ X : | fsp (x)| < ∆t ≤ cθt θ , 255 ∀t > 0. [sent-182, score-0.167]

42 , 1998; Vapnik, 1998) that ρ is said to be strictly separable with margin γ > 0 if ρ is separable together with the requirement y fsp (x) ≥ γ almost everywhere. [sent-187, score-0.144]

43 It is observed in applications that allowing ﬂexible variances improves the learnability of Gaussian kernels. [sent-193, score-0.151]

44 As we shall see in Section 6 for the least square loss, when the regression function fρ has Sobolev smoothness, the regularization error (11) associated with Gaussians with ﬂexible variances decays as O (λs ) for some s > 0. [sent-194, score-0.334]

45 Such a decay is impossible for the regularization error associated with a single Gaussian kernel, at least when ρX is the Lebesgue measure on X, as shown by Smale and Zhou (2003). [sent-196, score-0.164]

46 This demonstrates that learning algorithms using Gaussian kernels with ﬂexible variances have advantages for many applications. [sent-197, score-0.215]

47 Another way to obtain improved error rates is to take kernels changing with the sample size. [sent-198, score-0.181]

48 Kernels of this kind include polynomial kernels with changing degrees (Zhou and Jetter, 2006) and Gaussian kernels with changing variances (Steinwart, 2001; Steinwart and Scovel, 2005). [sent-199, score-0.316]

49 (a) Gσ ∈ Hσ and sup f ∈Hσ , f (b) Gσ and Gσ σ ≤ 2 Gσ C(X) σ ≤1 |Em ( f ) − E( f )| = Gσ C(X) ≤ κσ Gσ σ. [sent-225, score-0.22]

50 This in connection with (3) i=1 and property (b) of Proposition 7 tells us that for any f ∈ Hσ , Em ( f ) − E( f ) = f , 1 m σ ∑ K − Fσ m i=1 xi Then sup f ∈Hσ , f |Em ( f ) − E( f )| = σ ≤1 sup f ∈Hσ , f 257 σ = f , Gσ σ . [sent-228, score-0.625]

51 So we have Gσ m 2 σ = ∑ 1 m i=1 − Z X 1 m m ∑ K σ (xi , x j ) − F σ (xi ) j=1 1 m σ ∑ K (xi , y) − F σ (y) dµ(y) m i=1 1 m = ∑ Gσ (xi ) − Gσ (y)dµ(y) m i=1 X ≤ 2 sup |Gσ (y)| = 2 Gσ C(X) . [sent-234, score-0.22]

52 By Part (a) of Lemma 8, sup |Em ( f ) − E( f )| = sup f ∈H sup σ∈Σ f ∈Hσ , f |Em ( f ) − E( f )| = sup Gσ σ ≤1 σ∈Σ σ. [sent-241, score-0.88]

53 σ On the other hand, since {Ky : σ ∈ Σ, y ∈ X} is exactly the set F according to its deﬁnition (7), we see that 1 m σ sup |Em ( f ) − E( f )| = sup sup ∑ K (xi ) − F σ (y) = sup Gσ C(X) . [sent-242, score-0.88]

55 Gaussian Kernels Provide uGC Hypothesis Sets In this section we establish the uGC property of Gaussians with ﬂexible variances stated in Theorem 3. [sent-247, score-0.141]

56 The metric entropy of G is deﬁned as Hm,p (G , η) = sup log N p (G , x, η), x∈X m m ∈ N, η > 0. [sent-253, score-0.22]

57 γ 259 Y ING AND Z HOU (b) For every m ∈ N and 0 < ε < 1, there holds 4m sup N∞ (G , x, ε) ≤ 2 2 ε x∈X m 1+d log( 2em ) ε , d := Pε/4 (G ). [sent-268, score-0.22]

58 Applying Lemma 12 to G = F j of functions on R, we ﬁnd that for every 0 < ε < 1, Pε/4 (F j ) ≤ 16 + 2 and ε 4m sup N∞ (F , t, ε) ≤ N0 := 2 2 ε t∈Rm j 260 1+( 16 +2) log( 2em ) ε ε . [sent-301, score-0.22]

59 To this end, let 0 < ε < 1 and x = (xi )m where each point xi ∈ Rn can be expressed as a vector xi = (xi,1 , . [sent-303, score-0.154]

60 Error Bound by Rademacher Averages V In order to bound the error E ( f z,σ ) − E ( fρ ), we know from the error decomposition (13) that it is sufﬁcient to estimate the sample error and the regularization error. [sent-355, score-0.215]

61 The regularization error will be discussed in the next subsection. [sent-359, score-0.131]

62 The estimate of the sample error for the regularized multi-kernel scheme (4) involves the hypothesis space Hλ = [ σ∈Σ f ∈ HKσ : f Kσ ≤ M . [sent-360, score-0.165]

63 For every integer m, let m 1 sup ∑ εi f (zi ) Rm (F) := Eµ Eε m f ∈F i=1 where {zi }m are independent random variables distributed according to µ and {ε i }m are indei=1 i=1 pendent Rademacher random variables, that is, P(ε i = +1) = P(εi = −1) = 1/2. [sent-364, score-0.22]

64 Deﬁnition 15 We say that the loss function V : Y × R → R+ is admissible (with respect to ρ) if it ∞ is convex with respect to the second variable, M = V (y, 0) Lρ (Z) < +∞ and, for any λ > 0 there holds M Cλ = sup max(|V− (y,t)|, |V+ (y,t)|) : y ∈ Y, |t| ≤ κ < +∞. [sent-368, score-0.298]

65 , m}, φi : R → R is a function with φi (0) = 0 having a Lipschitz constant ci , then for any {xi }m , i=1 m Eε sup | ∑ εi φi ( f (xi ))| ≤ 2Eε sup m ∑ ci εi f (xi )| . [sent-376, score-0.44]

66 f ∈F i=1 f ∈F i=1 The sample error analysis of multi-kernel regularization scheme (4) involves the Rademacher complexity of the fundamental space F denoted by Rm (F ) = EρX Eε m 1 sup | ∑ εi f (xi )| . [sent-377, score-0.392]

67 m f ∈F i=1 Lemma 17 Let Hλ be deﬁned by (23), then M Rm (F ) λ E sup |E ( f ) − Ez ( f )| ≤ 4Cλ f ∈Hλ 1/2 2M +√ . [sent-378, score-0.22]

68 f ∈Hλ m i=1 Eρ Eε sup |E ( f ) − Ez ( f )| ≤ 2Eρ Eε sup f ∈Hλ To handle Rm (Vλ ), we apply Lemma 16. [sent-385, score-0.44]

69 i, j=1 It follows that m Eε sup f ∈Hλ ∑ εiV (yi , f (xi )) i=1 m ≤ 2Cλ M λ Eε sup f ∈H | ∑ εi f (xi )| √ +M m. [sent-389, score-0.44]

70 But by the deﬁnition of the fundamental set F , we know that m supσ∈Σ ∑mj=1 εi K σ (xi , x j )ε j ≤ m supσ∈Σ sups∈X | ∑ εi K σ (xi , s)| i, m i=1 = m sup f ∈F | ∑ εi f (xi )|. [sent-391, score-0.22]

71 (27) i=1 Combining the estimates (25), (26), and (27) implies that m Rm (Vλ ) ≤ 2Cλ M λ ≤ 2Cλ M λ EρX Eε 1 m Rm (F ) sup f ∈F | ∑ εi f (xi )| 1/2 i=1 + 1/2 M +√ m M √ . [sent-392, score-0.22]

72 The following error bound for the multi-kernel scheme (4) is a straightforward consequence of the sample error estimate and the error decomposition (13). [sent-394, score-0.167]

73 V V Together with the fact E(Ez ( fλ )) = E ( fλ ) and fz,λ ∈ Hλ deﬁned in (23), we know that V V E E ( fz,λ ) − Ez ( fz,λ ) + Ez ( fλ ) − E ( fλ ) ≤ E sup |E ( f ) − Ez ( f )| f ∈Hλ which in connection with Lemma 17 yields the desired estimate. [sent-399, score-0.22]

74 Then there exists an absolute constant C such that, for any sample {xi }m , there holds i=1 m 1 √ Eε sup | ∑ εi f (xi )| ≤ C m f ∈F i=1 Z +∞ 0 log N2 (F, x, ε)dε. [sent-402, score-0.22]

75 Applying the estimate in the proof for Theorem 3, we can compute the Radmacher average Rm (F ) for Gaussian kernels with ﬂexible variances, and hence yields the subsequent error bound. [sent-403, score-0.17]

76 In order to get explicit error rates, we see from Theorems 18 and 20 that what left is to estimate the regularization error. [sent-413, score-0.131]

77 2 Regularization Error with Gaussians In this subsection we exclusively focus on the multi-kernel regularization error (11) associated with the least square loss and Gaussian kernels. [sent-415, score-0.228]

78 We show that how the Fourier analysis (Stein, 1970) can be applied to get the polynomial decay of the regularization error under Sobolev smoothness condition on the regression function. [sent-416, score-0.164]

79 Before we go to the main point, it is worth brieﬂy mentioning why the multi-kernel regularization error can improve the error rates. [sent-417, score-0.173]

80 To this end, note that, for the regularization error of a single Gaussian kernel, it was proved by Smale and Zhou (2003) that the polynomial decay O (λ s ) for some s > 0 is impossible under the Sobolev smoothness hypothesis on the regression function f ρ . [sent-418, score-0.202]

81 Below we show that the multi-kernel scheme (4) associated with the least square loss and Gaussian kernels (9) with ﬂexible isotropic variances can give regularization errors D (λ) of polynomial decays O (λβ ) for some β > 0, under the assumption of Sobolev smoothness on f ρ . [sent-421, score-0.476]

82 To estimate the multi-kernel regularization error (11), we introduce some basic facts about RKHS (Aronszajn, 1950). [sent-423, score-0.131]

83 (32) Proof We shall use notations and results on limits of reproducing kernels (see Theorem I in Section 9 of Aronszajn (1950)). [sent-438, score-0.175]

84 f ∗ 2 K (E j ) = lim →∞ f ∗ , fσ, j, j |E j HKσ (E j ) H σ ≤ f ∗ HKσ (E j ) lim inf →∞ fσ, j, j |E j HKσ (E j ) which tells us that √ f ∗ HKσ (E j ) ≤ lim inf fσ, j, j |E j HKσ (E j ) ≤ ( πσ)n/2 f →∞ L2 (Rn ) . [sent-451, score-0.332]

85 Consider the multi-kernel scheme (4) with the least square loss V (y,t) = (y − t) 2 and the Gaussian kernels (9) with Σ = (0, ∞). [sent-456, score-0.239]

86 Finally, we are able to derive error rates for the multi-kernel regularization scheme (4) associated with Gaussian kernels (8) with ﬂexible variances. [sent-489, score-0.311]

87 This is done by putting the improved regularization error bound in Proposition 22 into the total error bound in Theorem 20. [sent-490, score-0.173]

88 We demonstrate the approach for regression with least square loss and for classiﬁcation with hinge loss. [sent-491, score-0.155]

89 If |y| ≤ M0 almost surely, the least square loss V (y, s) = √ 2 ∞ (y − s)2 is admissible with respect to ρ and M = V (y, 0) Lρ (Z) ≤ M0 , Cλ ≤ 2M0 (1 + κ/ λ). [sent-495, score-0.133]

90 Since the set (8) of Gaussian kernels with general variances σ = (σ 1 , . [sent-502, score-0.215]

91 , σn ) ∈ (0, ∞)n contains the set (9) of Gaussian kernels with isotropic variances, we see that D (λ) ≤ inf inf σ∈(0,∞) f ∈HKσ E ( f ) − E ( fρ ) + λ f 2 Kσ . [sent-505, score-0.181]

92 Now we move on to establish Example 2 for the regularized classiﬁcation with the hinge loss V (y, s) = (1 − ys)+ . [sent-511, score-0.144]

93 An important relation between the excess misclassiﬁcation error and the excess error was given by Zhang (2004) as R (sgn( f )) − R ( fc ) ≤ E ( f ) − E ( fc ), ∀ f : X → R. [sent-514, score-0.18]

94 (1997) as 2 2 log y+1 , sup N∞ (G , x, ε) ≤ 2 m ε x∈X m where d0 y=∑ i=1 m i 2 ε i ≤ 2 ε d0 em d0 d0 ≤ 2em ε d0 ≤ 2em ε d . [sent-545, score-0.267]

95 It tells us the following result: If T is a bounded subset of Rm , each function φi with the Lipschitz constant not more than 1 satisﬁes φi (0) = 0, and a function G : R+ → R+ is convex and nondecreasing, then there holds m m 1 (44) Eε G sup | ∑ εi φi (ti )| ≤ Eε G sup | ∑ εiti | . [sent-551, score-0.521]

97 K In this appendix, we verify the existence for the scheme (4) involving the least-square loss and Gaussians (9) with ﬂexible variances under some mild conditions on the sample z. [sent-559, score-0.236]

98 σ2 Consider the scheme with 0 < b < ∞ fz,λ := arg min min σ∈(0,b] f ∈HK σ 1 m ∑ ( f (xi ) − yi )2 + λ f m i=1 2 Kσ . [sent-567, score-0.187]

99 The above existence result largely depends on the least square loss and the assumption on the data. [sent-606, score-0.136]

100 It remains an open problem on how to prove the existence of the minimizer of the multi-kernel scheme (4) associated with general loss functions and data. [sent-607, score-0.18]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fz', 0.409), ('hk', 0.406), ('ez', 0.248), ('ugc', 0.234), ('sup', 0.22), ('rn', 0.217), ('hou', 0.185), ('aussians', 0.171), ('earnability', 0.171), ('fe', 0.167), ('lk', 0.136), ('rademacher', 0.119), ('variances', 0.114), ('kernels', 0.101), ('exible', 0.095), ('kx', 0.093), ('smale', 0.091), ('regularization', 0.089), ('alon', 0.082), ('tells', 0.081), ('ej', 0.078), ('xi', 0.077), ('ing', 0.077), ('gaussians', 0.074), ('zhou', 0.074), ('reproducing', 0.074), ('ky', 0.067), ('ying', 0.067), ('fsp', 0.066), ('rm', 0.065), ('borel', 0.064), ('lemma', 0.064), ('proposition', 0.06), ('kxi', 0.06), ('yi', 0.059), ('hinge', 0.058), ('minimizer', 0.058), ('im', 0.057), ('mercer', 0.057), ('covering', 0.057), ('lim', 0.057), ('yt', 0.056), ('square', 0.055), ('sgn', 0.053), ('sobolev', 0.053), ('gaussian', 0.051), ('scovel', 0.05), ('fc', 0.048), ('em', 0.047), ('cucker', 0.046), ('rkhs', 0.046), ('dt', 0.045), ('ledoux', 0.045), ('regularized', 0.044), ('steinwart', 0.043), ('error', 0.042), ('theorem', 0.042), ('loss', 0.042), ('kernel', 0.041), ('scheme', 0.041), ('kong', 0.04), ('shattered', 0.04), ('inf', 0.04), ('lebesgue', 0.04), ('supm', 0.04), ('separable', 0.039), ('lipschitz', 0.039), ('existence', 0.039), ('rates', 0.038), ('hypothesis', 0.038), ('dudley', 0.038), ('closure', 0.037), ('learnability', 0.037), ('compact', 0.037), ('sp', 0.037), ('koltchinskii', 0.037), ('admissible', 0.036), ('decays', 0.034), ('hong', 0.034), ('fourier', 0.034), ('aronszajn', 0.034), ('yiming', 0.033), ('mj', 0.033), ('wu', 0.033), ('decay', 0.033), ('statement', 0.032), ('bartlett', 0.031), ('min', 0.03), ('evgeniou', 0.03), ('panchenko', 0.03), ('stein', 0.03), ('dx', 0.027), ('talagrand', 0.027), ('dy', 0.027), ('exponent', 0.027), ('proof', 0.027), ('property', 0.027), ('arg', 0.027), ('schemes', 0.026), ('arzel', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

2 0.26561344 71 jmlr-2007-Refinable Kernels

Author: Yuesheng Xu, Haizhang Zhang

Abstract: Motivated by mathematical learning from training data, we introduce the notion of reﬁnable kernels. Various characterizations of reﬁnable kernels are presented. The concept of reﬁnable kernels leads to the introduction of wavelet-like reproducing kernels. We also investigate a reﬁnable kernel that forms a Riesz basis. In particular, we characterize reﬁnable translation invariant kernels, and reﬁnable kernels deﬁned by reﬁnable functions. This study leads to multiresolution analysis of reproducing kernel Hilbert spaces. Keywords: reﬁnable kernels, reﬁnable feature maps, wavelet-like reproducing kernels, dual kernels, learning with kernels, reproducing kernel Hilbert spaces, Riesz bases

3 0.23551781 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator deﬁned as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior speciﬁcations in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classiﬁcation problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT

4 0.13254808 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

Abstract: Regularization is an approach to function learning that balances ﬁt and smoothness. In practice, we search for a function f with a ﬁnite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning

5 0.12515453 9 jmlr-2007-AdaBoost is Consistent

Author: Peter L. Bartlett, Mikhail Traskin

Abstract: The risk, or probability of error, of the classiﬁer produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classiﬁers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency

6 0.10852787 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

7 0.091224894 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

8 0.079521008 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

9 0.066794641 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

10 0.062882818 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

11 0.062349014 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

12 0.057095967 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

13 0.056138489 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

14 0.055178989 70 jmlr-2007-Ranking the Best Instances

15 0.054301061 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

16 0.048303511 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

17 0.048091758 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

18 0.046154745 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

19 0.045363769 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

20 0.04400865 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.291), (1, -0.454), (2, 0.307), (3, 0.109), (4, 0.091), (5, 0.078), (6, 0.176), (7, -0.06), (8, -0.205), (9, -0.034), (10, -0.062), (11, 0.019), (12, 0.037), (13, 0.055), (14, -0.064), (15, -0.015), (16, 0.029), (17, -0.023), (18, 0.018), (19, -0.021), (20, -0.085), (21, -0.067), (22, -0.074), (23, -0.016), (24, 0.016), (25, -0.003), (26, -0.072), (27, 0.009), (28, 0.025), (29, -0.01), (30, 0.093), (31, -0.004), (32, -0.088), (33, -0.05), (34, 0.023), (35, 0.039), (36, 0.024), (37, 0.03), (38, 0.004), (39, 0.018), (40, 0.003), (41, 0.026), (42, 0.016), (43, -0.003), (44, 0.002), (45, -0.018), (46, 0.021), (47, -0.065), (48, 0.046), (49, -0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94449741 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

2 0.93698937 71 jmlr-2007-Refinable Kernels

Author: Yuesheng Xu, Haizhang Zhang

3 0.76495647 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

4 0.46220571 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

5 0.37882879 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

Author: Yann Guermeur

Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in ﬁnite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classiﬁcation problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classiﬁers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classiﬁers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the speciﬁc case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classiﬁers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection

6 0.37192464 9 jmlr-2007-AdaBoost is Consistent

7 0.30851552 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

8 0.28063798 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

9 0.26833555 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

10 0.26518753 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

11 0.26300368 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

12 0.24758759 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

13 0.22707106 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

14 0.21976078 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

15 0.21883874 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

16 0.20148368 70 jmlr-2007-Ranking the Best Instances

17 0.19993041 44 jmlr-2007-Large Margin Semi-supervised Learning

18 0.19801968 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

19 0.19476962 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

20 0.19307539 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.012), (8, 0.043), (12, 0.025), (28, 0.599), (40, 0.045), (48, 0.014), (60, 0.063), (85, 0.018), (98, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94329137 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

Author: David Mease, Abraham J. Wyner, Andreas Buja

Abstract: The standard by which binary classiﬁers are usually judged, misclassiﬁcation error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classiﬁcation trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classiﬁcation with unequal costs or, equivalently, classiﬁcation at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We ﬁrst examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overﬁt P[y = 1|x] even though they perform well as classiﬁers. A major negative point of the present article is the disconnect between class probability estimation and classiﬁcation. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overﬁtting, but for arbitrary misclassiﬁcation costs and, equivalently, arbitrary quantile boundaries. We then use collections of classiﬁers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratiﬁcation, data jittering

same-paper 2 0.93232077 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

3 0.85789257 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

Author: Marta Arias, Roni Khardon, Jérôme Maloberti

Abstract: The paper introduces L OG A N -H —a system for learning ﬁrst-order function-free Horn expressions from interpretations. The system is based on an algorithm that learns by asking questions and that was proved correct in previous work. The current paper shows how the algorithm can be implemented in a practical system, and introduces a new algorithm based on it that avoids interaction and learns from examples only. The L OG A N -H system implements these algorithms and adds several facilities and optimizations that allow efﬁcient applications in a wide range of problems. As one of the important ingredients, the system includes several fast procedures for solving the subsumption problem, an NP-complete problem that needs to be solved many times during the learning process. We describe qualitative and quantitative experiments in several domains. The experiments demonstrate that the system can deal with varied problems, large amounts of data, and that it achieves good classiﬁcation accuracy. Keywords: inductive logic programming, subsumption, bottom-up learning, learning with queries

4 0.61027396 71 jmlr-2007-Refinable Kernels

Author: Yuesheng Xu, Haizhang Zhang

5 0.59357947 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

Author: Francesco Dinuzzo, Marta Neve, Giuseppe De Nicolao, Ugo Pietro Gianazza

Abstract: Support Vector Regression (SVR) for discrete data is considered. An alternative formulation of the representer theorem is derived. This result is based on the newly introduced notion of pseudoresidual and the use of subdifferential calculus. The representer theorem is exploited to analyze the sensitivity properties of ε-insensitive SVR and introduce the notion of approximate degrees of freedom. The degrees of freedom are shown to play a key role in the evaluation of the optimism, that is the difference between the expected in-sample error and the expected empirical risk. In this way, it is possible to deﬁne a C p -like statistic that can be used for tuning the parameters of SVR. The proposed tuning procedure is tested on a simulated benchmark problem and on a real world problem (Boston Housing data set). Keywords: statistical learning, reproducing kernel Hilbert spaces, support vector machines, representer theorem, regularization theory

6 0.56565922 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

7 0.51991791 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

8 0.50451505 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

9 0.50182092 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

10 0.49662104 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

11 0.48252508 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

12 0.47870547 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

13 0.47708446 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

14 0.47521803 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

15 0.47221819 44 jmlr-2007-Large Margin Semi-supervised Learning

16 0.47166216 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

17 0.4687402 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

18 0.46818775 77 jmlr-2007-Stagewise Lasso

19 0.46651182 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

20 0.46581081 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression