jmlr jmlr2009 jmlr2009-16 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dao-Hong Xiang, Ding-Xuan Zhou
Abstract: This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classification, general convex loss, varying Gaussian kernels, covering number, approximation
Reference: text
sentIndex sentText sentNum sentScore
1 Our main goal is to provide fast convergence rates for the excess misclassification error. [sent-5, score-0.209]
2 Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. [sent-6, score-0.178]
3 Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. [sent-7, score-0.314]
4 The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. [sent-8, score-0.308]
5 Keywords: reproducing kernel Hilbert space, binary classification, general convex loss, varying Gaussian kernels, covering number, approximation 1. [sent-10, score-0.211]
6 Introduction In this paper we study binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. [sent-11, score-0.136]
7 The classifier minimizing the misclassification error is called the Bayes rule fc and is given by fc (x) = 1, if P(y = 1|x) ≥ P(y = −1|x), −1, otherwise. [sent-17, score-0.532]
8 The performance of a classifier C can be measured by the excess misclassification error R (C ) − R ( fc ). [sent-18, score-0.439]
9 X IANG AND Z HOU functions are generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. [sent-21, score-0.136]
10 Examples of classifying loss functions include the least-square loss φls (t) = (1 − t)2 , the hinge loss φh (t) = (1 −t)+ = max{1 −t, 0} for support vector machine (SVM) algorithms, and the r-norm SVM loss with 1 ≤ r < ∞ defined by φr (t) = (φh (t))r . [sent-23, score-0.253]
11 With the loss φ and Gaussian kernel K σ , the Tikhonov regularization scheme is defined (Wahba, 1990; Evgeniou et al. [sent-26, score-0.163]
12 , 2000; Cristianini and Shawe-Taylor, 2000) with a sample z = {(xi , yi )}m ∈ i=1 φ Z m as the solution fz = fz,σ,λ to the following minimization problem fz = arg min f ∈Hσ 1 m ∑ φ(yi f (xi )) + λ f 2 σ . [sent-27, score-1.254]
13 The purpose of this paper is to estimate the excess misclassification error R (sgn( fz )) − R ( fc ) as m → ∞. [sent-30, score-1.066]
14 Convergence rates will be derived under the choice of the parameters λ = λ(m) = m−γ , σ = σ(m) = λζ = m−γζ (3) for some γ, ζ > 0 and conditions on the distribution ρ and the loss φ. [sent-31, score-0.11]
15 Let us demonstrate our main results by stating learning rates for the least-square loss φ = φls . [sent-35, score-0.11]
16 They are given by means of a Tsybakov noise condition (Tsybakov, 2004) and a function smoothness condition stated in R terms of Sobolev spaces. [sent-37, score-0.156]
17 In general, noise condition (5) does not require smoothness of fρ in domains away from the decision boundary. [sent-59, score-0.127]
18 Note that as t → −∞, the hinge loss φh for the SVM studied in Steinwart and Scovel (2007) increases slowly: φh (t) = O(|t|), while the least-square loss φls in Theorem 1 increases moderately with φls (t) = O(|t|2 ). [sent-60, score-0.114]
19 Difficulty arises for the error analysis with a general loss φ when φ(t) increases fast such as φ = φr with very large r or the exponential-hinge loss we introduce in this paper as φeh (t) = max{e1−t − 1, 0} = e1−t − 1, if t ≤ 1, 0, otherwise. [sent-61, score-0.148]
20 In particular, explicit learning rates will be given in Section 4 for the r-norm SVM loss φr (Theorem 4) and the exponential-hinge loss φeh (Theorem 5). [sent-65, score-0.167]
21 Comparing with Theorem 1, we shall provide at the end of Section 4 an approximation theory viewpoint to the effect of various loss functions for learning algorithm (2): the exponential-hinge loss has some advantages over φls and φr , the r-norm SVM loss φr may have worse performance when r > 2. [sent-66, score-0.195]
22 Two Special Properties of Gaussians and Key Bounds The novelty in our approach for general φ and kernels K σ arises from two special properties of the Gaussian kernels with changing variance σ > 0: nice approximation scheme and low capacity of the RKHS, described in Sections 2. [sent-69, score-0.216]
23 When φ(t) increases fast (as t → −∞), applying the regularizing function fσ,λ in the error analysis (described in Section 2. [sent-78, score-0.108]
24 The first novelty of this paper is to construct a function fσ,λ (which plays the role of a regularizing function in an error decomposition approach discussed in subsection 2. [sent-80, score-0.135]
25 The construction of the explicit approximation scheme for fσ,λ φ is done under a Sobolev smoothness condition of a measurable function fρ minimizing E φ , that is, for a. [sent-85, score-0.141]
26 This shows difficulty in choosing fσ,λ and demonstrates novelty in choosing the function fσ,λ from Theorem 2 for the error analysis with a general loss φ. [sent-96, score-0.118]
27 2 Error Decomposition and Projection Operator The excess misclassification error R (sgn( f )) − R ( fc ) for the classifier sgn( f ) can be bounded by φ means of the excess generalization error E φ ( f )) − E φ ( fρ ) according to some comparison theorems (Zhang, 2004; Chen et al. [sent-99, score-0.629]
28 For example, it was proved in Zhang (2004) that for φ = φh and any measurable function f : X → R, we have R (sgn( f )) − R ( fc ) ≤ E φh ( f ) − E φh ( fc ). [sent-102, score-0.519]
29 (2006) that for some cφ > 0, φ R (sgn( f )) − R ( fc ) ≤ cφ E φ ( f ) − E φ ( fρ ). [sent-105, score-0.249]
30 (12) For the least square loss and ρ satisfying the Tsybakov noise condition, a comparison theorem improving (12) will be given in Section 4 and will be used to prove Theorem 1. [sent-106, score-0.159]
31 Then we can use (12) with f = π( fz ) to bound the excess misclasφ sification error R (sgn( fz ))− R ( fc ) by means of the excess generalization error E φ (π( fz ))− E φ ( fρ ) which in turn can be estimated by an error decomposition technique (Wu and Zhou, 2006). [sent-113, score-2.579]
32 Lemma 1 Let φ be a classifying loss, fz be defined by (2) and fσ,λ ∈ Hσ . [sent-116, score-0.652]
33 Then φ E φ (π( fz )) − E φ ( fρ ) ≤ D (σ, λ) + Sz ( fσ,λ ) − Sz (π( fz )), (13) where the quantity Sz ( f ) is defined for f ∈ C(X) by φ φ Sz ( f ) = [Ezφ ( f ) − Ezφ ( fρ )] − [E φ ( f ) − E φ ( fρ )]. [sent-117, score-1.254]
34 When we use the regularizing function fσ,λ given in Theorem 2, the bound (10) deals with D (σ, λ), the first term of (13). [sent-118, score-0.109]
35 The crucial remaining term Sz (π( fz )) of (13) involves the set of functions { fz }z∈Z m and can be treated by various empirical process techniques such as Rademacher average and entropy integral. [sent-120, score-1.254]
36 Here we use the specialty of the Gaussians that the RKHS has low capacity, hence the last term of (13) can be estimated efficiently and simply by means of covering numbers. [sent-121, score-0.139]
37 Definition 4 For a subset S of C(X) and η > 0, the covering number N (S, η) is the minimal integer l ∈ N such that there exist l disks with radius η covering S. [sent-124, score-0.206]
38 The covering numbers of the unit ball B1 (Cs (X)) of the space Cs (X) has the asymptotic behavior 1 1 c′ ( )n/s ≤ log N (B1 (Cs (X)), η) ≤ c′′ ( )n/s , s s η η 1452 (14) C LASSIFICATION WITH G AUSSIANS where the positive constants c′ , and c′′ are independent of 0 < η < 1. [sent-127, score-0.234]
39 In particular, since a Gaussian s s 1 ′′ 1 kernel K σ is C∞ , an embedding result from Zhou (2003) tells us that log N (B1 , η) ≤ Cs ( η )n/s ( σ )2n ′′ where s > 0 can be arbitrarily large but the constant Cs depends on s. [sent-128, score-0.182]
40 A crucial improved 1 1 bound for the covering number of B1 was given in Zhou (2002) with ( η )n/s replaced by (log η )n+1 as follows. [sent-130, score-0.138]
41 Proposition 1 There exists a constant C0 > 0 depending only on X and n such that 1 1 log N (B1 , η) ≤ C0 (log )n+1 + 2(n+1) η σ ∀ 0 < η < 1, 0 < σ ≤ 1. [sent-131, score-0.104]
42 Bound (15) is almost sharp in the ′ sense that for some C0 > 0 given in Zhou (2003), 1 1 ′ log N (B1 , η) ≥ C0 (log )n/2 + n . [sent-133, score-0.104]
43 This enables us to derive efficient error bounds for the algorithm (2) involving Gaussian kernels, by a simple covering number argument without other empirical process techniques or iteration techniques used in Steinwart and Scovel (2007) and Wu et al. [sent-135, score-0.158]
44 2 2 log , δ (16) (17) where C2 is the constant independent of m, λ, σ or δ. [sent-141, score-0.104]
45 4 Key Bounds We are in a position to present our key bounds for the excess generalization error E φ (π( fz )) − φ E φ ( fρ ) which will be used to get rates for the excess misclassification error R (sgn( fz )) − R ( fc ). [sent-143, score-1.957]
46 q For φ = φh , we can take τ = 0, and an improved power τ = q+1 if the Tsybakov noise condition (5) is satisfied (Steinwart and Scovel, 2007; Wu and Zhou, 2005). [sent-148, score-0.11]
47 Theorem 3 Let σ = λζ and λ = m−γ for some 0 < ζ < 1 and 0 < γ < n some s > 0, then for any 0 < δ < 1, with confidence 1 − δ we have φ E φ (π( fz )) − E φ ( fρ ) ≤ Cm−θ log where θ = min sζγ, γ(1 − nζ), and C is a constant independent of m and δ. [sent-150, score-0.731]
48 1 Proof of Lemma 1 Write the regularized excess generalization error as φ E φ (π( fz )) − E φ ( fρ ) + λ fz + Ezφ (π( fz )) + λ fz 2 Hσ φ 2 Hσ = E φ (π( fz )) − Ez (π( fz )) φ − Ez ( fσ,λ ) + λ fσ,λ 2 σ H φ φ + Ez ( fσ,λ ) − E φ ( fσ,λ ) + E φ ( fσ,λ ) − E φ ( fρ ) + λ fσ,λ 2 σ . [sent-156, score-3.952]
49 This in connection with the definition of fz tells us that the second term on the right-hand side above is at most zero. [sent-159, score-0.668]
50 By φ φ subtracting and adding E φ ( fρ ) in the first and third terms we see E φ (π( fz )) − E φ ( fρ ) is bounded as in (13). [sent-160, score-0.627]
51 φ Let us turn to estimate E φ (π( fz )) − E φ ( fρ ) by (13). [sent-162, score-0.627]
52 For any 0 < δ < 1, with confidence 1 − 2 , the term Sz ( fσ,λ ) of (13) can be bounded as Sz ( fσ,λ ) ≤ 2( φ C[−B,B] +C1 ) log 2 − 1 φ m 2−τ + E φ ( fσ,λ ) − E φ ( fρ ). [sent-166, score-0.104]
53 Solving the quadratic equation for ε by setting the above probability bound to be δ/2, we see that with confidence at least 1 − δ/2, 4 φ 1 m ∑ ξ(zi ) − E(ξ) ≤ m i=1 2 C[−B,B] log δ 3m + 2mσ2 (ξ) log 2 δ m . [sent-172, score-0.243]
54 This in connection with Young’s inequality implies 2mσ2 (ξ) log 2 δ m ≤ 2 log 2 C1 (E(ξ))τ τ 2 log 2 C1 δ δ ≤ (1 − ) m 2 m 1 2−τ τ + E(ξ). [sent-174, score-0.333]
55 2 Therefore, with confidence at least 1 − δ/2, 4 φ 1 m ∑ ξ(zi ) − E(ξ) ≤ m i=1 2 C[−B,B] log δ 3m 2 log 2 C1 δ m + 1 2−τ + E(ξ). [sent-175, score-0.208]
56 1 The sample error term −Sz (π( fz )) in (13) can be expressed as ξz dρ − m ∑m ξz (zi ) with i=1 φ ξz (z) = φ(y fz (x)) − φ(y fρ (x)). [sent-177, score-1.288]
57 Here we use the specialty of low capacity of the RKHS Hσ and overcome the difficulty by a simple covering number argument over a ball of Hσ where fz lies. [sent-180, score-0.827]
58 The proof follows easily by taking f = 0 in the definition of fz as in De Vito et al. [sent-182, score-0.627]
59 , 2007; Yao, 2008; Ying, 2007) with covering numbers for the ball { f ∈ Hσ : f Hσ ≤ φ(0)/λ} of the RKHS Hσ , we find the following bound. [sent-189, score-0.13]
60 For any 0 < δ < 1, with confidence at least 1 − δ, we have φ E φ (π( fz )) − E φ ( fρ ) ≤ 4D (σ, λ) + 40ε∗ (m, λ, σ, δ/2) + 4( φ C[−B,B] +C1 ) log 2 − 1 m 2−τ . [sent-195, score-0.731]
61 δ δ Proof Applying Lemma 3, we know that there is a subset V1 of Z m with measure at least 1 − 2 such that for z ∈ V1 , 1 2 Sz ( fσ,λ ) ≤ 2( φ C[−B,B] +C1 ) log m− 2−τ + D (σ, λ). [sent-196, score-0.126]
62 1 Adding the above two bounds and observing that 0 ≤ τ ≤ 1 implies 1−τ/2 ≤ 2 we know from Lemma 1 that for z ∈ V1 ∩V2 , ≤ 1− φ E φ (π( fz )) − E φ ( fρ ) ≤ 4D (σ, λ) + 40ε∗ (m, λ, σ, δ/2) + 4( φ C[−B,B] +C1 ) log 2 − 1 m 2−τ . [sent-199, score-0.774]
63 Deriving Learning Rates In this section we apply Theorem 3 to derive learning rates with various loss functions. [sent-208, score-0.11]
64 Proposition 3 If φ = φls and ρ satisfies noise condition (5) for some q ∈ [0, ∞], then for every measurable function f : X → R, we have q − q+2 E φls ( f ) − E φls ( fρ ) R (sgn( f )) − R ( fc ) ≤ 2Cq q+1 q+2 . [sent-210, score-0.326]
65 Proof Denote X f = {x ∈ X : sgn( f )(x) = fc (x)}. [sent-211, score-0.249]
66 It is known that R (sgn( f ))− R ( fc ) = See, for example, Equation (9. [sent-212, score-0.249]
67 Let us derive learning rates with the r-norm SVM loss φ = φr (1 < r < ∞) for which we have (Chen et al. [sent-231, score-0.11]
68 Then for any 0 < δ < 1, with confidence 1 − δ, we have R (sgn( fz )) − R ( fc ) ≤ Cρ,r m−θr log 2 δ with θr = s 2(s+2n+2) , s 4(s(1−1/r)+n+1) , if 1 < r ≤ 2, if 2 < r < ∞. [sent-236, score-0.98]
69 (22) Proof The convexity of φr gives the variancing power (Bartlett et al. [sent-237, score-0.112]
70 This bound for the excess generalization error together with comparison relation (12) caused by φ′′ (0) = r(r − 1) > 0 yields the desired bound r (22) for the excess misclassification error with the constant Cρ,r = cφr C. [sent-241, score-0.45]
71 2ζ(n+1) φ When φ = φeh , a simple computation shows that the function fρ is given by 1 log 1+ fρ (x) , if − (e2 − 1)/(e2 + 1) ≤ fρ (x) ≤ (e2 − 1)/(e2 + 1), 2 1− fρ (x) φ fρ eh (x) = 1, if fρ (x) > (e2 − 1)/(e2 + 1), −1, if fρ (x) < −(e2 − 1)/(e2 + 1). [sent-244, score-0.307]
72 Then for any 0 < δ < 1, with confidence 1 − δ, we have R (sgn( fz )) − R ( fc ) ≤ Cρ,eh m−θeh log 2 δ with θeh = s . [sent-248, score-0.98]
73 When s ≤ 1, the same φ condition for fρ implies assumption (8) of fρ eh needed for Theorem 5, as seen from expression (23). [sent-257, score-0.232]
74 It is possible to refine learning rates (22) and (24) by improving comparison relation (12) when Tsybakov noise condition (5) is satisfied. [sent-258, score-0.13]
75 We know from Smale and Zhou (2003) that when φ = φls the approximation error and hence learning rates can essentially be characterized by regularities of the function fρ . [sent-265, score-0.133]
76 However, the index s in regularity assumption φ (8) for the function fρ might vary dramatically, leading to varying power index θ for the learning rates. [sent-268, score-0.123]
77 The dependence of the function fρ eh on fρ has an advantage of ignoring any irregularity appearing in the domain where | fρ (x)| > (e2 − 1)/(e2 + 1). [sent-270, score-0.203]
78 This can be φ seen from the following example where fρ has a singularity at 0 while fρ eh ≡ 1 is C∞ . [sent-271, score-0.227]
79 So regularity assumption (8) is satisfied for φ Sobolev space H ls 2 if and only if s < α + 1 . [sent-274, score-0.284]
80 Then from Theorem 1, we see the learning rate R (sgn( fz )) − R ( fc ) = 2 s O(m−θls log 2 ) with θls = s+2+2 arbitrarily close to 1+2α < 1 . [sent-275, score-1.006]
81 However, for the exponential-hinge 9+2α 8 δ φ loss φeh , we have fρ eh ≡ 1 which follows from expression (23) and the definition fρ (x) = 1 − 1 |x|α ≥ 5 1459 X IANG AND Z HOU 1 1 − 5 > (e2 − 1)/(e2 + 1) on X. [sent-276, score-0.26]
82 Therefore, regularity assumption (8) is satisfied for an arbitrarily large s and Theorem 5 yields the learning rate R (sgn( fz )) − R ( fc ) = O(m−θeh log 2 ) with θeh arbiδ trarily close to 1 . [sent-277, score-1.036]
83 Then Theorem 4 yields the 2 s learning rate R (sgn( fz )) − R ( fc ) = O(m−θr log 2 ) with θr = 4(s(1−1/r)+1+1) arbitrarily close to δ This power index is always less than that of φls or φeh . [sent-284, score-1.059]
84 If X is a connected compact C∞ submanifold of n without boundary and its dimension is d ≤ n, then the covering number estimate (15) holds R with n replaced by the manifold dimension d. [sent-291, score-0.128]
85 Learning rates in Theorems 1, 4 and 5 can be improved with n replaced by d if approximation error estimates similar to Theorem 2 can be established in the manifold setting. [sent-293, score-0.136]
86 In our analysis we assume that the convex loss φ has a zero which excludes the logistic loss φ(t) = log(1 + e−t ). [sent-303, score-0.114]
87 One might generalize our analysis to get some error bounds for the scheme with loss functions without zero by using a general projection operator πM with level M > 0 given by if f (x) > M, M −M if f (x) < −M, πM ( f )(x) = f (x) if − M ≤ f (x) ≤ M. [sent-304, score-0.175]
88 This bound in connection with (30) and (31) implies (26) with the constant B given by φ ′ B = max Cs f˜ρ φ f˜ρ ′ 2 L∞ (Rn ) , (Cs ) n 2 −2 , L2 (Rn ) (2π) φ ′ sup {|φ′ (ξ)| : |ξ| ≤ (Cs + 1) f˜ρ + φ (2) Without the condition f˜ρ ∈ L∞ (Rn ), we bound fσ,λ f˜σ . [sent-339, score-0.141]
89 With this estimate, under Tsybakov noise condition (5), learning rates are obtained in Steinwart and Scovel (2007). [sent-356, score-0.13]
90 For example, when α > q+2 , for an arbitrarily small ε > 0, 2q with confidence 1 − δ, 4 log δ R (sgn( fz )) − R ( fc ) ≤ Cε 2 1 m 2α(q+1) 2α(q+2)+3q+4 −ε . [sent-357, score-1.006]
91 (33) φ Since no Sobolev smoothness is assumed for fρ h = fc (Wahba, 1990), we need to use the regularizing function fσ,λ defined by (7) and derive by some detailed computations that with confidence 1 − δ, (q+1)αn 4 1 (q+2)αn+2(q+1)(n+1) . [sent-358, score-0.373]
92 R (sgn( fz )) − R ( fc ) ≤ Cρ,h log δ m This rate is slightly worse than (33), though the estimate for the confidence is slightly better. [sent-359, score-0.98]
93 Role of Tight Bounds for Covering Numbers In this appendix we prove Lemma 2 which shows a special role of the tight bound (15) for covering numbers concerning Gaussian kernels. [sent-362, score-0.185]
94 + 1464 , C LASSIFICATION WITH G AUSSIANS Proof Observe from (15) that as a function on (0, +∞), the logarithm of the middle term of (16) is bounded by φ(0)|φ′ (−1)| n+1 1 √ + h(ε) := C0 log + 2(n+1) − g(ε), σ λε where g is the strictly increasing function on (0, ∞) defined by g(ε) = mε2−τ . [sent-366, score-0.104]
95 2C1 + 2 φ(−1)ε1−τ 3 Set φ(0)|φ′ (−1)| √ + + λm∆ B = + C0 4C1 (log 2 + σ2(n+1) ) + 4C0C1 (∆ log m)n+1 δ 1 2−τ m 4φ(−1) 2 C0 log + 2(n+1) +C0 (∆ log m)n+1 . [sent-367, score-0.312]
96 3m δ σ 2 If 3 φ(−1)B 1−τ ≤ 2C1 , then g(B ) ≥ 2 C0 mB 2−τ ≥ log + 2(n+1) +C0 (∆ log m)n+1 . [sent-368, score-0.208]
97 4C1 δ σ 2 If 3 φ(−1)B 1−τ > 2C1 , then g(B ) ≥ mB 2−τ 4 1−τ 3 φ(−1)B = mB 4 3 φ(−1) 2 C0 ≥ log + 2(n+1) +C0 (∆ log m)n+1 . [sent-369, score-0.208]
98 δ σ Thus in either case we have C0 2 g(B ) ≥ log + 2(n+1) +C0 (∆ log m)n+1 . [sent-370, score-0.208]
99 It folOn the other hand, since B ≥ , we also see that log ∆ λm B λ lows that 2 δ h(B ) ≤ C0 (∆ log m)n+1 − log −C0 (∆ log m)n+1 = log . [sent-372, score-0.52]
100 Then we know from the special form (3) of λ and σ that ε∗ (m, λ, σ, δ/2) ≤ C2 log 2 δ m 1 2−τ + 1 m 1−2γζ(n+1) 2−τ 1465 + (log m)n+1 m 1 2−τ + φ(0) m . [sent-380, score-0.126]
wordName wordTfidf (topN-words)
[('fz', 0.627), ('rn', 0.256), ('ls', 0.254), ('fc', 0.249), ('eh', 0.203), ('sgn', 0.164), ('cqt', 0.156), ('excess', 0.156), ('aussians', 0.142), ('hou', 0.133), ('iang', 0.133), ('sz', 0.12), ('tsybakov', 0.108), ('ez', 0.106), ('zhou', 0.104), ('log', 0.104), ('covering', 0.103), ('scovel', 0.096), ('cs', 0.093), ('steinwart', 0.082), ('vito', 0.076), ('lassification', 0.074), ('regularizing', 0.074), ('sobolev', 0.074), ('cq', 0.072), ('wu', 0.065), ('misclassi', 0.064), ('gaussians', 0.059), ('loss', 0.057), ('variancing', 0.057), ('theorem', 0.054), ('dence', 0.053), ('rates', 0.053), ('ying', 0.052), ('smoothness', 0.05), ('smale', 0.049), ('rkhs', 0.048), ('noise', 0.048), ('tight', 0.047), ('tikhonov', 0.043), ('gaussian', 0.043), ('probz', 0.043), ('scheme', 0.038), ('dy', 0.037), ('specialty', 0.036), ('xiang', 0.036), ('regularization', 0.036), ('dx', 0.035), ('kernels', 0.035), ('bound', 0.035), ('capacity', 0.034), ('error', 0.034), ('power', 0.033), ('cucker', 0.033), ('kernel', 0.032), ('reproducing', 0.032), ('zi', 0.031), ('regularity', 0.03), ('young', 0.03), ('condition', 0.029), ('lemma', 0.028), ('hardin', 0.028), ('proposition', 0.028), ('chen', 0.028), ('kong', 0.028), ('ball', 0.027), ('novelty', 0.027), ('fourier', 0.026), ('arbitrarily', 0.026), ('classifying', 0.025), ('operator', 0.025), ('svm', 0.025), ('yao', 0.025), ('hong', 0.025), ('manifold', 0.025), ('singularity', 0.024), ('cityu', 0.024), ('edmunds', 0.024), ('pan', 0.024), ('schaback', 0.024), ('xf', 0.024), ('approximation', 0.024), ('ye', 0.023), ('exponent', 0.023), ('schemes', 0.023), ('nice', 0.023), ('know', 0.022), ('convexity', 0.022), ('bounds', 0.021), ('proved', 0.021), ('evgeniou', 0.021), ('mb', 0.021), ('mathematics', 0.021), ('sup', 0.021), ('connection', 0.021), ('varying', 0.02), ('al', 0.02), ('tells', 0.02), ('index', 0.02), ('classi', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 16 jmlr-2009-Classification with Gaussians and Convex Loss
Author: Dao-Hong Xiang, Ding-Xuan Zhou
Abstract: This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classification, general convex loss, varying Gaussian kernels, covering number, approximation
2 0.21914919 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
Author: Ting Hu, Ding-Xuan Zhou
Abstract: Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a H¨ lder space. For regression with least square or insensitive loss, learning rates are given o in both the RKHS norm and the L2 norm. For classification with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassification error. Keywords: sampling with non-identical distributions, online learning, classification with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space
3 0.065112539 18 jmlr-2009-Consistency and Localizability
Author: Alon Zakai, Ya'acov Ritov
Abstract: We show that all consistent learning methods—that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )—are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation
4 0.06338881 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
Author: Christian Rieger, Barbara Zwicknagl
Abstract: We introduce a new technique for the analysis of kernel-based regression problems. The basic tools are sampling inequalities which apply to all machine learning problems involving penalty terms induced by kernels related to Sobolev spaces. They lead to explicit deterministic results concerning the worst case behaviour of ε- and ν-SVRs. Using these, we show how to adjust regularization parameters to get best possible approximation orders for regression. The results are illustrated by some numerical examples. Keywords: sampling inequality, radial basis functions, approximation theory, reproducing kernel Hilbert space, Sobolev space
5 0.061070304 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
Author: Haizhang Zhang, Yuesheng Xu, Jun Zhang
Abstract: We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semiinner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established. Keywords: reproducing kernel Banach spaces, reproducing kernels, learning theory, semi-innerproducts, representer theorems
6 0.058052525 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
7 0.056242328 78 jmlr-2009-Refinement of Reproducing Kernels
8 0.052882899 46 jmlr-2009-Learning Halfspaces with Malicious Noise
9 0.048498519 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
10 0.048269659 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
11 0.047482874 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
12 0.04518434 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
13 0.037697248 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
14 0.035420839 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
15 0.035253838 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
16 0.034810118 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
18 0.032720163 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling
19 0.029770151 38 jmlr-2009-Hash Kernels for Structured Data
20 0.028786214 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
topicId topicWeight
[(0, 0.182), (1, 0.085), (2, 0.024), (3, -0.021), (4, -0.26), (5, 0.1), (6, 0.068), (7, -0.016), (8, 0.136), (9, 0.218), (10, 0.015), (11, -0.128), (12, -0.101), (13, -0.025), (14, 0.08), (15, -0.134), (16, -0.082), (17, -0.038), (18, 0.115), (19, 0.028), (20, 0.093), (21, -0.005), (22, 0.11), (23, -0.113), (24, -0.146), (25, 0.009), (26, -0.123), (27, 0.052), (28, 0.116), (29, 0.021), (30, -0.051), (31, 0.022), (32, 0.053), (33, -0.025), (34, 0.194), (35, 0.172), (36, -0.056), (37, 0.097), (38, 0.128), (39, 0.036), (40, -0.089), (41, -0.209), (42, 0.167), (43, -0.04), (44, -0.093), (45, 0.052), (46, 0.026), (47, 0.034), (48, -0.139), (49, 0.137)]
simIndex simValue paperId paperTitle
same-paper 1 0.95242751 16 jmlr-2009-Classification with Gaussians and Convex Loss
Author: Dao-Hong Xiang, Ding-Xuan Zhou
Abstract: This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classification, general convex loss, varying Gaussian kernels, covering number, approximation
2 0.59031111 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
Author: Ting Hu, Ding-Xuan Zhou
Abstract: Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a H¨ lder space. For regression with least square or insensitive loss, learning rates are given o in both the RKHS norm and the L2 norm. For classification with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassification error. Keywords: sampling with non-identical distributions, online learning, classification with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space
Author: Christian Rieger, Barbara Zwicknagl
Abstract: We introduce a new technique for the analysis of kernel-based regression problems. The basic tools are sampling inequalities which apply to all machine learning problems involving penalty terms induced by kernels related to Sobolev spaces. They lead to explicit deterministic results concerning the worst case behaviour of ε- and ν-SVRs. Using these, we show how to adjust regularization parameters to get best possible approximation orders for regression. The results are illustrated by some numerical examples. Keywords: sampling inequality, radial basis functions, approximation theory, reproducing kernel Hilbert space, Sobolev space
4 0.36527109 46 jmlr-2009-Learning Halfspaces with Malicious Noise
Author: Adam R. Klivans, Philip M. Long, Rocco A. Servedio
Abstract: We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: • Learning origin-centered halfspaces in Rn with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε2 / log(n/ε)). (The best previous result was Ω(ε/(n log(n/ε))1/4 ).) • Learning origin-centered halfspaces with respect to any isotropic logconcave distribution on Rn with malicious noise rate η = Ω(ε3 / log2 (n/ε)). This is the first efficient algorithm for learning under isotropic log-concave distributions in the presence of malicious noise. We also give a poly(n, 1/ε)-time algorithm for learning origin-centered halfspaces under any isotropic log-concave distribution on Rn in the presence of adversarial label noise at rate η = Ω(ε3 / log(1/ε)). In the adversarial label noise setting (or agnostic model), labels can be noisy, but not example points themselves. Previous results could handle η = Ω(ε) but had running time exponential in an unspecified function of 1/ε. Our analysis crucially exploits both concentration and anti-concentration properties of isotropic log-concave distributions. Our algorithms combine an iterative outlier removal procedure using Principal Component Analysis together with “smooth” boosting. Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, halfspace learning, linear classifiers c 2009 Adam R. Klivans, Philip M. Long and Rocco A. Servedio. K LIVANS , L ONG AND S ERVEDIO
5 0.35510096 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
6 0.32601598 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
7 0.31598246 18 jmlr-2009-Consistency and Localizability
8 0.30510834 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
9 0.26769146 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
11 0.21528195 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling
12 0.20284227 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
13 0.20057383 78 jmlr-2009-Refinement of Reproducing Kernels
14 0.19946413 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
15 0.19678542 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations
16 0.18196484 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
17 0.17528544 13 jmlr-2009-Bounded Kernel-Based Online Learning
18 0.17060424 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
19 0.16945738 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)
20 0.16721936 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
topicId topicWeight
[(11, 0.013), (21, 0.01), (38, 0.011), (52, 0.029), (55, 0.053), (58, 0.014), (66, 0.641), (90, 0.034), (96, 0.068)]
simIndex simValue paperId paperTitle
1 0.99341965 49 jmlr-2009-Learning Permutations with Exponential Weights
Author: David P. Helmbold, Manfred K. Warmuth
Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is significantly better than that of either Kalai and Vempala’s more efficient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing
same-paper 2 0.9920606 16 jmlr-2009-Classification with Gaussians and Convex Loss
Author: Dao-Hong Xiang, Ding-Xuan Zhou
Abstract: This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classification, general convex loss, varying Gaussian kernels, covering number, approximation
3 0.99063396 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes
Author: Jan Ramon, Siegfried Nijssen
Abstract: Algorithms that list graphs such that no two listed graphs are isomorphic, are important building blocks of systems for mining and learning in graphs. Algorithms are already known that solve this problem efficiently for many classes of graphs of restricted topology, such as trees. In this article we introduce the concept of a dense augmentation schema, and introduce an algorithm that can be used to enumerate any class of graphs with polynomial delay, as long as the class of graphs can be described using a monotonic predicate operating on a dense augmentation schema. In practice this means that this is the first enumeration algorithm that can be applied theoretically efficiently in any frequent subgraph mining algorithm, and that this algorithm generalizes to situations beyond the standard frequent subgraph mining setting. Keywords: graph mining, enumeration, monotonic graph classes
4 0.97303724 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
Author: Roberto Esposito, Daniele P. Radicioni
Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always finds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus confirming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classifiers optimization, exact inference
5 0.97108716 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
Author: Jun Zhu, Eric P. Xing
Abstract: The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy discrimination learning of classification rules to a new structural maxentropy discrimination paradigm of learning a distribution of Markov networks. 2) It generalizes the extant Markov network structured-prediction rule based on a point estimator of model coefficients to an averaging model akin to a Bayesian predictor that integrates over a learned posterior distribution of model coefficients. 3) It admits flexible entropic regularization of the model during learning. By plugging in different prior distributions of the model coefficients, it subsumes the wellknown maximum margin Markov networks (M3 N) as a special case, and leads to a model similar to an L1 -regularized M3 N that is simultaneously primal and dual sparse, or other new types of Markov networks. 4) It applies a modular learning algorithm that combines existing variational inference techniques and convex-optimization based M3 N solvers as subroutines. Essentially, MEDN can be understood as a jointly maximum likelihood and maximum margin estimate of Markov network. It represents the first successful attempt to combine maximum entropy learning (a dual form of maximum likelihood learning) with maximum margin learning of Markov network for structured input/output problems; and the basic principle can be generalized to learning arbi
6 0.87324774 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
7 0.85920042 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
8 0.8365398 73 jmlr-2009-Prediction With Expert Advice For The Brier Game
9 0.83259284 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
10 0.83177561 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
11 0.82081074 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
12 0.81046623 78 jmlr-2009-Refinement of Reproducing Kernels
13 0.80924863 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
14 0.80777073 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
15 0.80209577 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression
16 0.8003335 46 jmlr-2009-Learning Halfspaces with Malicious Noise
17 0.79660535 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
18 0.78997338 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
19 0.78705662 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
20 0.78620809 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost