jmlr jmlr2009 jmlr2009-16 knowledge-graph by maker-knowledge-mining

16 jmlr-2009-Classification with Gaussians and Convex Loss

Source: pdf

Author: Dao-Hong Xiang, Ding-Xuan Zhou

Abstract: This paper considers binary classiﬁcation algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassiﬁcation error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classiﬁcation, general convex loss, varying Gaussian kernels, covering number, approximation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our main goal is to provide fast convergence rates for the excess misclassiﬁcation error. [sent-5, score-0.209]

2 Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. [sent-6, score-0.178]

3 Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. [sent-7, score-0.314]

4 The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. [sent-8, score-0.308]

5 Keywords: reproducing kernel Hilbert space, binary classiﬁcation, general convex loss, varying Gaussian kernels, covering number, approximation 1. [sent-10, score-0.211]

6 Introduction In this paper we study binary classiﬁcation algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. [sent-11, score-0.136]

7 The classiﬁer minimizing the misclassiﬁcation error is called the Bayes rule fc and is given by fc (x) = 1, if P(y = 1|x) ≥ P(y = −1|x), −1, otherwise. [sent-17, score-0.532]

8 The performance of a classiﬁer C can be measured by the excess misclassiﬁcation error R (C ) − R ( fc ). [sent-18, score-0.439]

9 X IANG AND Z HOU functions are generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. [sent-21, score-0.136]

10 Examples of classifying loss functions include the least-square loss φls (t) = (1 − t)2 , the hinge loss φh (t) = (1 −t)+ = max{1 −t, 0} for support vector machine (SVM) algorithms, and the r-norm SVM loss with 1 ≤ r < ∞ deﬁned by φr (t) = (φh (t))r . [sent-23, score-0.253]

11 With the loss φ and Gaussian kernel K σ , the Tikhonov regularization scheme is deﬁned (Wahba, 1990; Evgeniou et al. [sent-26, score-0.163]

12 , 2000; Cristianini and Shawe-Taylor, 2000) with a sample z = {(xi , yi )}m ∈ i=1 φ Z m as the solution fz = fz,σ,λ to the following minimization problem fz = arg min f ∈Hσ 1 m ∑ φ(yi f (xi )) + λ f 2 σ . [sent-27, score-1.254]

13 The purpose of this paper is to estimate the excess misclassiﬁcation error R (sgn( fz )) − R ( fc ) as m → ∞. [sent-30, score-1.066]

14 Convergence rates will be derived under the choice of the parameters λ = λ(m) = m−γ , σ = σ(m) = λζ = m−γζ (3) for some γ, ζ > 0 and conditions on the distribution ρ and the loss φ. [sent-31, score-0.11]

15 Let us demonstrate our main results by stating learning rates for the least-square loss φ = φls . [sent-35, score-0.11]

16 They are given by means of a Tsybakov noise condition (Tsybakov, 2004) and a function smoothness condition stated in R terms of Sobolev spaces. [sent-37, score-0.156]

17 In general, noise condition (5) does not require smoothness of fρ in domains away from the decision boundary. [sent-59, score-0.127]

18 Note that as t → −∞, the hinge loss φh for the SVM studied in Steinwart and Scovel (2007) increases slowly: φh (t) = O(|t|), while the least-square loss φls in Theorem 1 increases moderately with φls (t) = O(|t|2 ). [sent-60, score-0.114]

19 Difﬁculty arises for the error analysis with a general loss φ when φ(t) increases fast such as φ = φr with very large r or the exponential-hinge loss we introduce in this paper as φeh (t) = max{e1−t − 1, 0} = e1−t − 1, if t ≤ 1, 0, otherwise. [sent-61, score-0.148]

20 In particular, explicit learning rates will be given in Section 4 for the r-norm SVM loss φr (Theorem 4) and the exponential-hinge loss φeh (Theorem 5). [sent-65, score-0.167]

21 Comparing with Theorem 1, we shall provide at the end of Section 4 an approximation theory viewpoint to the effect of various loss functions for learning algorithm (2): the exponential-hinge loss has some advantages over φls and φr , the r-norm SVM loss φr may have worse performance when r > 2. [sent-66, score-0.195]

22 Two Special Properties of Gaussians and Key Bounds The novelty in our approach for general φ and kernels K σ arises from two special properties of the Gaussian kernels with changing variance σ > 0: nice approximation scheme and low capacity of the RKHS, described in Sections 2. [sent-69, score-0.216]

23 When φ(t) increases fast (as t → −∞), applying the regularizing function fσ,λ in the error analysis (described in Section 2. [sent-78, score-0.108]

24 The ﬁrst novelty of this paper is to construct a function fσ,λ (which plays the role of a regularizing function in an error decomposition approach discussed in subsection 2. [sent-80, score-0.135]

25 The construction of the explicit approximation scheme for fσ,λ φ is done under a Sobolev smoothness condition of a measurable function fρ minimizing E φ , that is, for a. [sent-85, score-0.141]

26 This shows difﬁculty in choosing fσ,λ and demonstrates novelty in choosing the function fσ,λ from Theorem 2 for the error analysis with a general loss φ. [sent-96, score-0.118]

27 2 Error Decomposition and Projection Operator The excess misclassiﬁcation error R (sgn( f )) − R ( fc ) for the classiﬁer sgn( f ) can be bounded by φ means of the excess generalization error E φ ( f )) − E φ ( fρ ) according to some comparison theorems (Zhang, 2004; Chen et al. [sent-99, score-0.629]

28 For example, it was proved in Zhang (2004) that for φ = φh and any measurable function f : X → R, we have R (sgn( f )) − R ( fc ) ≤ E φh ( f ) − E φh ( fc ). [sent-102, score-0.519]

29 (2006) that for some cφ > 0, φ R (sgn( f )) − R ( fc ) ≤ cφ E φ ( f ) − E φ ( fρ ). [sent-105, score-0.249]

30 (12) For the least square loss and ρ satisfying the Tsybakov noise condition, a comparison theorem improving (12) will be given in Section 4 and will be used to prove Theorem 1. [sent-106, score-0.159]

31 Then we can use (12) with f = π( fz ) to bound the excess misclasφ siﬁcation error R (sgn( fz ))− R ( fc ) by means of the excess generalization error E φ (π( fz ))− E φ ( fρ ) which in turn can be estimated by an error decomposition technique (Wu and Zhou, 2006). [sent-113, score-2.579]

32 Lemma 1 Let φ be a classifying loss, fz be deﬁned by (2) and fσ,λ ∈ Hσ . [sent-116, score-0.652]

33 Then φ E φ (π( fz )) − E φ ( fρ ) ≤ D (σ, λ) + Sz ( fσ,λ ) − Sz (π( fz )), (13) where the quantity Sz ( f ) is deﬁned for f ∈ C(X) by φ φ Sz ( f ) = [Ezφ ( f ) − Ezφ ( fρ )] − [E φ ( f ) − E φ ( fρ )]. [sent-117, score-1.254]

34 When we use the regularizing function fσ,λ given in Theorem 2, the bound (10) deals with D (σ, λ), the ﬁrst term of (13). [sent-118, score-0.109]

35 The crucial remaining term Sz (π( fz )) of (13) involves the set of functions { fz }z∈Z m and can be treated by various empirical process techniques such as Rademacher average and entropy integral. [sent-120, score-1.254]

36 Here we use the specialty of the Gaussians that the RKHS has low capacity, hence the last term of (13) can be estimated efﬁciently and simply by means of covering numbers. [sent-121, score-0.139]

37 Deﬁnition 4 For a subset S of C(X) and η > 0, the covering number N (S, η) is the minimal integer l ∈ N such that there exist l disks with radius η covering S. [sent-124, score-0.206]

38 The covering numbers of the unit ball B1 (Cs (X)) of the space Cs (X) has the asymptotic behavior 1 1 c′ ( )n/s ≤ log N (B1 (Cs (X)), η) ≤ c′′ ( )n/s , s s η η 1452 (14) C LASSIFICATION WITH G AUSSIANS where the positive constants c′ , and c′′ are independent of 0 < η < 1. [sent-127, score-0.234]

39 In particular, since a Gaussian s s 1 ′′ 1 kernel K σ is C∞ , an embedding result from Zhou (2003) tells us that log N (B1 , η) ≤ Cs ( η )n/s ( σ )2n ′′ where s > 0 can be arbitrarily large but the constant Cs depends on s. [sent-128, score-0.182]

40 A crucial improved 1 1 bound for the covering number of B1 was given in Zhou (2002) with ( η )n/s replaced by (log η )n+1 as follows. [sent-130, score-0.138]

41 Proposition 1 There exists a constant C0 > 0 depending only on X and n such that 1 1 log N (B1 , η) ≤ C0 (log )n+1 + 2(n+1) η σ ∀ 0 < η < 1, 0 < σ ≤ 1. [sent-131, score-0.104]

42 Bound (15) is almost sharp in the ′ sense that for some C0 > 0 given in Zhou (2003), 1 1 ′ log N (B1 , η) ≥ C0 (log )n/2 + n . [sent-133, score-0.104]

43 This enables us to derive efﬁcient error bounds for the algorithm (2) involving Gaussian kernels, by a simple covering number argument without other empirical process techniques or iteration techniques used in Steinwart and Scovel (2007) and Wu et al. [sent-135, score-0.158]

44 2 2 log , δ (16) (17) where C2 is the constant independent of m, λ, σ or δ. [sent-141, score-0.104]

45 4 Key Bounds We are in a position to present our key bounds for the excess generalization error E φ (π( fz )) − φ E φ ( fρ ) which will be used to get rates for the excess misclassiﬁcation error R (sgn( fz )) − R ( fc ). [sent-143, score-1.957]

46 q For φ = φh , we can take τ = 0, and an improved power τ = q+1 if the Tsybakov noise condition (5) is satisﬁed (Steinwart and Scovel, 2007; Wu and Zhou, 2005). [sent-148, score-0.11]

47 Theorem 3 Let σ = λζ and λ = m−γ for some 0 < ζ < 1 and 0 < γ < n some s > 0, then for any 0 < δ < 1, with conﬁdence 1 − δ we have φ E φ (π( fz )) − E φ ( fρ ) ≤ Cm−θ log where θ = min sζγ, γ(1 − nζ), and C is a constant independent of m and δ. [sent-150, score-0.731]

48 1 Proof of Lemma 1 Write the regularized excess generalization error as φ E φ (π( fz )) − E φ ( fρ ) + λ fz + Ezφ (π( fz )) + λ fz 2 Hσ φ 2 Hσ = E φ (π( fz )) − Ez (π( fz )) φ − Ez ( fσ,λ ) + λ fσ,λ 2 σ H φ φ + Ez ( fσ,λ ) − E φ ( fσ,λ ) + E φ ( fσ,λ ) − E φ ( fρ ) + λ fσ,λ 2 σ . [sent-156, score-3.952]

49 This in connection with the deﬁnition of fz tells us that the second term on the right-hand side above is at most zero. [sent-159, score-0.668]

50 By φ φ subtracting and adding E φ ( fρ ) in the ﬁrst and third terms we see E φ (π( fz )) − E φ ( fρ ) is bounded as in (13). [sent-160, score-0.627]

51 φ Let us turn to estimate E φ (π( fz )) − E φ ( fρ ) by (13). [sent-162, score-0.627]

52 For any 0 < δ < 1, with conﬁdence 1 − 2 , the term Sz ( fσ,λ ) of (13) can be bounded as Sz ( fσ,λ ) ≤ 2( φ C[−B,B] +C1 ) log 2 − 1 φ m 2−τ + E φ ( fσ,λ ) − E φ ( fρ ). [sent-166, score-0.104]

53 Solving the quadratic equation for ε by setting the above probability bound to be δ/2, we see that with conﬁdence at least 1 − δ/2, 4 φ 1 m ∑ ξ(zi ) − E(ξ) ≤ m i=1 2 C[−B,B] log δ 3m + 2mσ2 (ξ) log 2 δ m . [sent-172, score-0.243]

54 This in connection with Young’s inequality implies 2mσ2 (ξ) log 2 δ m ≤ 2 log 2 C1 (E(ξ))τ τ 2 log 2 C1 δ δ ≤ (1 − ) m 2 m 1 2−τ τ + E(ξ). [sent-174, score-0.333]

55 2 Therefore, with conﬁdence at least 1 − δ/2, 4 φ 1 m ∑ ξ(zi ) − E(ξ) ≤ m i=1 2 C[−B,B] log δ 3m 2 log 2 C1 δ m + 1 2−τ + E(ξ). [sent-175, score-0.208]

56 1 The sample error term −Sz (π( fz )) in (13) can be expressed as ξz dρ − m ∑m ξz (zi ) with i=1 φ ξz (z) = φ(y fz (x)) − φ(y fρ (x)). [sent-177, score-1.288]

57 Here we use the specialty of low capacity of the RKHS Hσ and overcome the difﬁculty by a simple covering number argument over a ball of Hσ where fz lies. [sent-180, score-0.827]

58 The proof follows easily by taking f = 0 in the deﬁnition of fz as in De Vito et al. [sent-182, score-0.627]

59 , 2007; Yao, 2008; Ying, 2007) with covering numbers for the ball { f ∈ Hσ : f Hσ ≤ φ(0)/λ} of the RKHS Hσ , we ﬁnd the following bound. [sent-189, score-0.13]

60 For any 0 < δ < 1, with conﬁdence at least 1 − δ, we have φ E φ (π( fz )) − E φ ( fρ ) ≤ 4D (σ, λ) + 40ε∗ (m, λ, σ, δ/2) + 4( φ C[−B,B] +C1 ) log 2 − 1 m 2−τ . [sent-195, score-0.731]

61 δ δ Proof Applying Lemma 3, we know that there is a subset V1 of Z m with measure at least 1 − 2 such that for z ∈ V1 , 1 2 Sz ( fσ,λ ) ≤ 2( φ C[−B,B] +C1 ) log m− 2−τ + D (σ, λ). [sent-196, score-0.126]

62 1 Adding the above two bounds and observing that 0 ≤ τ ≤ 1 implies 1−τ/2 ≤ 2 we know from Lemma 1 that for z ∈ V1 ∩V2 , ≤ 1− φ E φ (π( fz )) − E φ ( fρ ) ≤ 4D (σ, λ) + 40ε∗ (m, λ, σ, δ/2) + 4( φ C[−B,B] +C1 ) log 2 − 1 m 2−τ . [sent-199, score-0.774]

63 Deriving Learning Rates In this section we apply Theorem 3 to derive learning rates with various loss functions. [sent-208, score-0.11]

64 Proposition 3 If φ = φls and ρ satisﬁes noise condition (5) for some q ∈ [0, ∞], then for every measurable function f : X → R, we have q − q+2 E φls ( f ) − E φls ( fρ ) R (sgn( f )) − R ( fc ) ≤ 2Cq q+1 q+2 . [sent-210, score-0.326]

65 Proof Denote X f = {x ∈ X : sgn( f )(x) = fc (x)}. [sent-211, score-0.249]

66 It is known that R (sgn( f ))− R ( fc ) = See, for example, Equation (9. [sent-212, score-0.249]

67 Let us derive learning rates with the r-norm SVM loss φ = φr (1 < r < ∞) for which we have (Chen et al. [sent-231, score-0.11]

68 Then for any 0 < δ < 1, with conﬁdence 1 − δ, we have R (sgn( fz )) − R ( fc ) ≤ Cρ,r m−θr log 2 δ with θr = s 2(s+2n+2) , s 4(s(1−1/r)+n+1) , if 1 < r ≤ 2, if 2 < r < ∞. [sent-236, score-0.98]

69 (22) Proof The convexity of φr gives the variancing power (Bartlett et al. [sent-237, score-0.112]

70 This bound for the excess generalization error together with comparison relation (12) caused by φ′′ (0) = r(r − 1) > 0 yields the desired bound r (22) for the excess misclassiﬁcation error with the constant Cρ,r = cφr C. [sent-241, score-0.45]

71 2ζ(n+1) φ When φ = φeh , a simple computation shows that the function fρ is given by   1 log 1+ fρ (x) , if − (e2 − 1)/(e2 + 1) ≤ fρ (x) ≤ (e2 − 1)/(e2 + 1),  2 1− fρ (x) φ fρ eh (x) = 1, if fρ (x) > (e2 − 1)/(e2 + 1),   −1, if fρ (x) < −(e2 − 1)/(e2 + 1). [sent-244, score-0.307]

72 Then for any 0 < δ < 1, with conﬁdence 1 − δ, we have R (sgn( fz )) − R ( fc ) ≤ Cρ,eh m−θeh log 2 δ with θeh = s . [sent-248, score-0.98]

73 When s ≤ 1, the same φ condition for fρ implies assumption (8) of fρ eh needed for Theorem 5, as seen from expression (23). [sent-257, score-0.232]

74 It is possible to reﬁne learning rates (22) and (24) by improving comparison relation (12) when Tsybakov noise condition (5) is satisﬁed. [sent-258, score-0.13]

75 We know from Smale and Zhou (2003) that when φ = φls the approximation error and hence learning rates can essentially be characterized by regularities of the function fρ . [sent-265, score-0.133]

76 However, the index s in regularity assumption φ (8) for the function fρ might vary dramatically, leading to varying power index θ for the learning rates. [sent-268, score-0.123]

77 The dependence of the function fρ eh on fρ has an advantage of ignoring any irregularity appearing in the domain where | fρ (x)| > (e2 − 1)/(e2 + 1). [sent-270, score-0.203]

78 This can be φ seen from the following example where fρ has a singularity at 0 while fρ eh ≡ 1 is C∞ . [sent-271, score-0.227]

79 So regularity assumption (8) is satisﬁed for φ Sobolev space H ls 2 if and only if s < α + 1 . [sent-274, score-0.284]

80 Then from Theorem 1, we see the learning rate R (sgn( fz )) − R ( fc ) = 2 s O(m−θls log 2 ) with θls = s+2+2 arbitrarily close to 1+2α < 1 . [sent-275, score-1.006]

81 However, for the exponential-hinge 9+2α 8 δ φ loss φeh , we have fρ eh ≡ 1 which follows from expression (23) and the deﬁnition fρ (x) = 1 − 1 |x|α ≥ 5 1459 X IANG AND Z HOU 1 1 − 5 > (e2 − 1)/(e2 + 1) on X. [sent-276, score-0.26]

82 Therefore, regularity assumption (8) is satisﬁed for an arbitrarily large s and Theorem 5 yields the learning rate R (sgn( fz )) − R ( fc ) = O(m−θeh log 2 ) with θeh arbiδ trarily close to 1 . [sent-277, score-1.036]

83 Then Theorem 4 yields the 2 s learning rate R (sgn( fz )) − R ( fc ) = O(m−θr log 2 ) with θr = 4(s(1−1/r)+1+1) arbitrarily close to δ This power index is always less than that of φls or φeh . [sent-284, score-1.059]

84 If X is a connected compact C∞ submanifold of n without boundary and its dimension is d ≤ n, then the covering number estimate (15) holds R with n replaced by the manifold dimension d. [sent-291, score-0.128]

85 Learning rates in Theorems 1, 4 and 5 can be improved with n replaced by d if approximation error estimates similar to Theorem 2 can be established in the manifold setting. [sent-293, score-0.136]

86 In our analysis we assume that the convex loss φ has a zero which excludes the logistic loss φ(t) = log(1 + e−t ). [sent-303, score-0.114]

87 One might generalize our analysis to get some error bounds for the scheme with loss functions without zero by using a general projection operator πM with level M > 0 given by  if f (x) > M,  M −M if f (x) < −M, πM ( f )(x) =  f (x) if − M ≤ f (x) ≤ M. [sent-304, score-0.175]

88 This bound in connection with (30) and (31) implies (26) with the constant B given by φ ′ B = max Cs f˜ρ φ f˜ρ ′ 2 L∞ (Rn ) , (Cs ) n 2 −2 , L2 (Rn ) (2π) φ ′ sup {|φ′ (ξ)| : |ξ| ≤ (Cs + 1) f˜ρ + φ (2) Without the condition f˜ρ ∈ L∞ (Rn ), we bound fσ,λ f˜σ . [sent-339, score-0.141]

89 With this estimate, under Tsybakov noise condition (5), learning rates are obtained in Steinwart and Scovel (2007). [sent-356, score-0.13]

90 For example, when α > q+2 , for an arbitrarily small ε > 0, 2q with conﬁdence 1 − δ, 4 log δ R (sgn( fz )) − R ( fc ) ≤ Cε 2 1 m 2α(q+1) 2α(q+2)+3q+4 −ε . [sent-357, score-1.006]

91 (33) φ Since no Sobolev smoothness is assumed for fρ h = fc (Wahba, 1990), we need to use the regularizing function fσ,λ deﬁned by (7) and derive by some detailed computations that with conﬁdence 1 − δ, (q+1)αn 4 1 (q+2)αn+2(q+1)(n+1) . [sent-358, score-0.373]

92 R (sgn( fz )) − R ( fc ) ≤ Cρ,h log δ m This rate is slightly worse than (33), though the estimate for the conﬁdence is slightly better. [sent-359, score-0.98]

93 Role of Tight Bounds for Covering Numbers In this appendix we prove Lemma 2 which shows a special role of the tight bound (15) for covering numbers concerning Gaussian kernels. [sent-362, score-0.185]

94 + 1464 , C LASSIFICATION WITH G AUSSIANS Proof Observe from (15) that as a function on (0, +∞), the logarithm of the middle term of (16) is bounded by φ(0)|φ′ (−1)| n+1 1 √ + h(ε) := C0 log + 2(n+1) − g(ε), σ λε where g is the strictly increasing function on (0, ∞) deﬁned by g(ε) = mε2−τ . [sent-366, score-0.104]

95 2C1 + 2 φ(−1)ε1−τ 3 Set φ(0)|φ′ (−1)| √ + + λm∆ B = + C0 4C1 (log 2 + σ2(n+1) ) + 4C0C1 (∆ log m)n+1 δ 1 2−τ m 4φ(−1) 2 C0 log + 2(n+1) +C0 (∆ log m)n+1 . [sent-367, score-0.312]

96 3m δ σ 2 If 3 φ(−1)B 1−τ ≤ 2C1 , then g(B ) ≥ 2 C0 mB 2−τ ≥ log + 2(n+1) +C0 (∆ log m)n+1 . [sent-368, score-0.208]

97 4C1 δ σ 2 If 3 φ(−1)B 1−τ > 2C1 , then g(B ) ≥ mB 2−τ 4 1−τ 3 φ(−1)B = mB 4 3 φ(−1) 2 C0 ≥ log + 2(n+1) +C0 (∆ log m)n+1 . [sent-369, score-0.208]

98 δ σ Thus in either case we have C0 2 g(B ) ≥ log + 2(n+1) +C0 (∆ log m)n+1 . [sent-370, score-0.208]

99 It folOn the other hand, since B ≥ , we also see that log ∆ λm B λ lows that 2 δ h(B ) ≤ C0 (∆ log m)n+1 − log −C0 (∆ log m)n+1 = log . [sent-372, score-0.52]

100 Then we know from the special form (3) of λ and σ that ε∗ (m, λ, σ, δ/2) ≤ C2 log 2 δ m 1 2−τ + 1 m 1−2γζ(n+1) 2−τ 1465 + (log m)n+1 m 1 2−τ + φ(0) m . [sent-380, score-0.126]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fz', 0.627), ('rn', 0.256), ('ls', 0.254), ('fc', 0.249), ('eh', 0.203), ('sgn', 0.164), ('cqt', 0.156), ('excess', 0.156), ('aussians', 0.142), ('hou', 0.133), ('iang', 0.133), ('sz', 0.12), ('tsybakov', 0.108), ('ez', 0.106), ('zhou', 0.104), ('log', 0.104), ('covering', 0.103), ('scovel', 0.096), ('cs', 0.093), ('steinwart', 0.082), ('vito', 0.076), ('lassification', 0.074), ('regularizing', 0.074), ('sobolev', 0.074), ('cq', 0.072), ('wu', 0.065), ('misclassi', 0.064), ('gaussians', 0.059), ('loss', 0.057), ('variancing', 0.057), ('theorem', 0.054), ('dence', 0.053), ('rates', 0.053), ('ying', 0.052), ('smoothness', 0.05), ('smale', 0.049), ('rkhs', 0.048), ('noise', 0.048), ('tight', 0.047), ('tikhonov', 0.043), ('gaussian', 0.043), ('probz', 0.043), ('scheme', 0.038), ('dy', 0.037), ('specialty', 0.036), ('xiang', 0.036), ('regularization', 0.036), ('dx', 0.035), ('kernels', 0.035), ('bound', 0.035), ('capacity', 0.034), ('error', 0.034), ('power', 0.033), ('cucker', 0.033), ('kernel', 0.032), ('reproducing', 0.032), ('zi', 0.031), ('regularity', 0.03), ('young', 0.03), ('condition', 0.029), ('lemma', 0.028), ('hardin', 0.028), ('proposition', 0.028), ('chen', 0.028), ('kong', 0.028), ('ball', 0.027), ('novelty', 0.027), ('fourier', 0.026), ('arbitrarily', 0.026), ('classifying', 0.025), ('operator', 0.025), ('svm', 0.025), ('yao', 0.025), ('hong', 0.025), ('manifold', 0.025), ('singularity', 0.024), ('cityu', 0.024), ('edmunds', 0.024), ('pan', 0.024), ('schaback', 0.024), ('xf', 0.024), ('approximation', 0.024), ('ye', 0.023), ('exponent', 0.023), ('schemes', 0.023), ('nice', 0.023), ('know', 0.022), ('convexity', 0.022), ('bounds', 0.021), ('proved', 0.021), ('evgeniou', 0.021), ('mb', 0.021), ('mathematics', 0.021), ('sup', 0.021), ('connection', 0.021), ('varying', 0.02), ('al', 0.02), ('tells', 0.02), ('index', 0.02), ('classi', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 16 jmlr-2009-Classification with Gaussians and Convex Loss

Author: Dao-Hong Xiang, Ding-Xuan Zhou

2 0.21914919 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

Author: Ting Hu, Ding-Xuan Zhou

Abstract: Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a H¨ lder space. For regression with least square or insensitive loss, learning rates are given o in both the RKHS norm and the L2 norm. For classiﬁcation with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassiﬁcation error. Keywords: sampling with non-identical distributions, online learning, classiﬁcation with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space

3 0.065112539 18 jmlr-2009-Consistency and Localizability

Author: Alon Zakai, Ya'acov Ritov

Abstract: We show that all consistent learning methodsĂ˘€”that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )Ă˘€”are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the methodĂ˘€™s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation

4 0.06338881 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

Author: Christian Rieger, Barbara Zwicknagl

Abstract: We introduce a new technique for the analysis of kernel-based regression problems. The basic tools are sampling inequalities which apply to all machine learning problems involving penalty terms induced by kernels related to Sobolev spaces. They lead to explicit deterministic results concerning the worst case behaviour of ε- and ν-SVRs. Using these, we show how to adjust regularization parameters to get best possible approximation orders for regression. The results are illustrated by some numerical examples. Keywords: sampling inequality, radial basis functions, approximation theory, reproducing kernel Hilbert space, Sobolev space

5 0.061070304 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

Author: Haizhang Zhang, Yuesheng Xu, Jun Zhang

Abstract: We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semiinner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established. Keywords: reproducing kernel Banach spaces, reproducing kernels, learning theory, semi-innerproducts, representer theorems

6 0.058052525 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

7 0.056242328 78 jmlr-2009-Refinement of Reproducing Kernels

8 0.052882899 46 jmlr-2009-Learning Halfspaces with Malicious Noise

9 0.048498519 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

10 0.048269659 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

11 0.047482874 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

12 0.04518434 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

13 0.037697248 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

14 0.035420839 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

15 0.035253838 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

16 0.034810118 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

17 0.034494106 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

18 0.032720163 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling

19 0.029770151 38 jmlr-2009-Hash Kernels for Structured Data

20 0.028786214 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.182), (1, 0.085), (2, 0.024), (3, -0.021), (4, -0.26), (5, 0.1), (6, 0.068), (7, -0.016), (8, 0.136), (9, 0.218), (10, 0.015), (11, -0.128), (12, -0.101), (13, -0.025), (14, 0.08), (15, -0.134), (16, -0.082), (17, -0.038), (18, 0.115), (19, 0.028), (20, 0.093), (21, -0.005), (22, 0.11), (23, -0.113), (24, -0.146), (25, 0.009), (26, -0.123), (27, 0.052), (28, 0.116), (29, 0.021), (30, -0.051), (31, 0.022), (32, 0.053), (33, -0.025), (34, 0.194), (35, 0.172), (36, -0.056), (37, 0.097), (38, 0.128), (39, 0.036), (40, -0.089), (41, -0.209), (42, 0.167), (43, -0.04), (44, -0.093), (45, 0.052), (46, 0.026), (47, 0.034), (48, -0.139), (49, 0.137)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95242751 16 jmlr-2009-Classification with Gaussians and Convex Loss

Author: Dao-Hong Xiang, Ding-Xuan Zhou

2 0.59031111 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

Author: Ting Hu, Ding-Xuan Zhou

3 0.4087514 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

Author: Christian Rieger, Barbara Zwicknagl

4 0.36527109 46 jmlr-2009-Learning Halfspaces with Malicious Noise

Author: Adam R. Klivans, Philip M. Long, Rocco A. Servedio

Abstract: We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: • Learning origin-centered halfspaces in Rn with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε2 / log(n/ε)). (The best previous result was Ω(ε/(n log(n/ε))1/4 ).) • Learning origin-centered halfspaces with respect to any isotropic logconcave distribution on Rn with malicious noise rate η = Ω(ε3 / log2 (n/ε)). This is the ﬁrst efﬁcient algorithm for learning under isotropic log-concave distributions in the presence of malicious noise. We also give a poly(n, 1/ε)-time algorithm for learning origin-centered halfspaces under any isotropic log-concave distribution on Rn in the presence of adversarial label noise at rate η = Ω(ε3 / log(1/ε)). In the adversarial label noise setting (or agnostic model), labels can be noisy, but not example points themselves. Previous results could handle η = Ω(ε) but had running time exponential in an unspeciﬁed function of 1/ε. Our analysis crucially exploits both concentration and anti-concentration properties of isotropic log-concave distributions. Our algorithms combine an iterative outlier removal procedure using Principal Component Analysis together with “smooth” boosting. Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, halfspace learning, linear classiﬁers c 2009 Adam R. Klivans, Philip M. Long and Rocco A. Servedio. K LIVANS , L ONG AND S ERVEDIO

5 0.35510096 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

Author: Tong Zhang

Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches inﬁnity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefﬁcient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefﬁcients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefﬁcients. Keywords: greedy algorithm, feature selection, sparsity

6 0.32601598 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

7 0.31598246 18 jmlr-2009-Consistency and Localizability

8 0.30510834 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

9 0.26769146 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

10 0.2432785 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

11 0.21528195 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling

12 0.20284227 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

13 0.20057383 78 jmlr-2009-Refinement of Reproducing Kernels

14 0.19946413 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

15 0.19678542 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations

16 0.18196484 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

17 0.17528544 13 jmlr-2009-Bounded Kernel-Based Online Learning

18 0.17060424 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

19 0.16945738 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)

20 0.16721936 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.013), (21, 0.01), (38, 0.011), (52, 0.029), (55, 0.053), (58, 0.014), (66, 0.641), (90, 0.034), (96, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99341965 49 jmlr-2009-Learning Permutations with Exponential Weights

Author: David P. Helmbold, Manfred K. Warmuth

Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efﬁcient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is signiﬁcantly better than that of either Kalai and Vempala’s more efﬁcient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing

same-paper 2 0.9920606 16 jmlr-2009-Classification with Gaussians and Convex Loss

Author: Dao-Hong Xiang, Ding-Xuan Zhou

3 0.99063396 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes

Author: Jan Ramon, Siegfried Nijssen

Abstract: Algorithms that list graphs such that no two listed graphs are isomorphic, are important building blocks of systems for mining and learning in graphs. Algorithms are already known that solve this problem efﬁciently for many classes of graphs of restricted topology, such as trees. In this article we introduce the concept of a dense augmentation schema, and introduce an algorithm that can be used to enumerate any class of graphs with polynomial delay, as long as the class of graphs can be described using a monotonic predicate operating on a dense augmentation schema. In practice this means that this is the ﬁrst enumeration algorithm that can be applied theoretically efﬁciently in any frequent subgraph mining algorithm, and that this algorithm generalizes to situations beyond the standard frequent subgraph mining setting. Keywords: graph mining, enumeration, monotonic graph classes

4 0.97303724 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning

Author: Roberto Esposito, Daniele P. Radicioni

Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always ﬁnds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus conﬁrming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classiﬁers optimization, exact inference

5 0.97108716 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

Author: Jun Zhu, Eric P. Xing

Abstract: The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy discrimination learning of classiﬁcation rules to a new structural maxentropy discrimination paradigm of learning a distribution of Markov networks. 2) It generalizes the extant Markov network structured-prediction rule based on a point estimator of model coefﬁcients to an averaging model akin to a Bayesian predictor that integrates over a learned posterior distribution of model coefﬁcients. 3) It admits ﬂexible entropic regularization of the model during learning. By plugging in different prior distributions of the model coefﬁcients, it subsumes the wellknown maximum margin Markov networks (M3 N) as a special case, and leads to a model similar to an L1 -regularized M3 N that is simultaneously primal and dual sparse, or other new types of Markov networks. 4) It applies a modular learning algorithm that combines existing variational inference techniques and convex-optimization based M3 N solvers as subroutines. Essentially, MEDN can be understood as a jointly maximum likelihood and maximum margin estimate of Markov network. It represents the ﬁrst successful attempt to combine maximum entropy learning (a dual form of maximum likelihood learning) with maximum margin learning of Markov network for structured input/output problems; and the basic principle can be generalized to learning arbi

6 0.87324774 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

7 0.85920042 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

8 0.8365398 73 jmlr-2009-Prediction With Expert Advice For The Brier Game

9 0.83259284 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

10 0.83177561 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

11 0.82081074 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

12 0.81046623 78 jmlr-2009-Refinement of Reproducing Kernels

13 0.80924863 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

14 0.80777073 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

15 0.80209577 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression

16 0.8003335 46 jmlr-2009-Learning Halfspaces with Malicious Noise

17 0.79660535 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

18 0.78997338 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

19 0.78705662 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations

20 0.78620809 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost