jmlr jmlr2007 jmlr2007-9 knowledge-graph by maker-knowledge-mining

9 jmlr-2007-AdaBoost is Consistent

Source: pdf

Author: Peter L. Bartlett, Mikhail Traskin

Abstract: The risk, or probability of error, of the classiﬁer produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classiﬁers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 While empirical studies show that boosting is one of the best off the shelf classiﬁcation algorithms (see Breiman, 1998) theoretical results do not give a complete explanation of their effectiveness. [sent-12, score-0.199]

2 Later Breiman (1997, 1998, 2004) pointed out that boosting is a gradient descent type algorithm (see also Friedman et al. [sent-14, score-0.199]

3 Experimental results by Drucker and Cortes (1996), Quinlan (1996), Breiman (1998), Bauer and Kohavi (1999) and Dietterich (2000) showed that boosting is a very effective method, that often leads to a low test error. [sent-17, score-0.199]

4 It was also noted that boosting continues to decrease test error long after the sample error becomes zero: though it keeps adding more weak classiﬁers to the linear combination of classiﬁers, the generalization error, perhaps surprisingly, usually does not increase. [sent-18, score-0.222]

5 However some of the experiments suggested that there might be problems, since boosting performed worse than bagging in the presence of noise (Dietterich, 2000), and boosting concentrated not only on the “hard” areas, but also on outliers and noise (Bauer and Kohavi, 1999). [sent-19, score-0.44]

6 Upper bounds on the risk of boosted classiﬁers were obtained, based on the fact that boosting tends to maximize the margin of the training examples (Schapire et al. [sent-26, score-0.261]

7 , 1998; Koltchinskii and Panchenko, 2002), but Breiman (1999) pointed out that margin-based bounds do not completely explain the success of boosting methods. [sent-27, score-0.199]

8 Jiang (2004) analyses the unmodiﬁed boosting algorithm and proves a process consistency property, under certain assumptions. [sent-36, score-0.279]

9 Process consistency means that there exists a sequence (tn ) such that if AdaBoost with sample size n is stopped after tn iterations, its risk approaches the Bayes risk. [sent-37, score-0.427]

10 Of course we want this probability to be as small as possible and close to the Bayes risk L = inf L(g) = E(min{η(X), 1 − η(X)}), g 2348 A DA B OOST IS C ONSISTENT where the inﬁmum is taken over all possible (measurable) classiﬁers and η(·) is a conditional probability η(x) = P(Y = 1|X = x). [sent-57, score-0.246]

11 Then the boosting procedure can be described as follows. [sent-69, score-0.199]

12 Rϕ,n ( fk ) ≤ γ inf h∈H ,α∈R Rϕ,n ( fk−1 + αh) + (1 − γ)Rϕ,n ( fk−1 ). [sent-78, score-0.358]

13 2349 BARTLETT AND T RASKIN We shall also use the convex hull of H scaled by λ ≥ 0, n n i=1 Fλ = i=1 f f = ∑ λi hi , n ∈ N ∪ {0}, λi ≥ 0, ∑ λi = λ, hi ∈ H as well as the set of k-combinations, k ∈ N, of functions in H k f f = ∑ λi hi , λi ∈ R, hi ∈ H Fk= . [sent-84, score-0.384]

14 i=1 We also need to deﬁne the l -norm: for any f ∈ F f = inf ∑ |αi |, f = ∑ αi hi , hi ∈ H . [sent-85, score-0.376]

15 However boosting in general and AdaBoost in particular may produce functions that cannot be appropriately bounded. [sent-105, score-0.199]

16 Furthermore, the complexity of the clipped class (as measured by its pseudo-dimension—see Pollard, 1984) grows slowly with the stopping time t, so we can show that the ϕ-risk of a clipped function is not much larger than its empirical ϕ-risk. [sent-109, score-0.183]

17 In order to compare the empirical ϕ-risk of the clipped function to that of a suitable reference sequence f¯n , we ﬁrst use the fact that the empirical ϕ-risk of a clipped function πλ ◦ ft is not much larger than the empirical ϕ-risk of ft . [sent-111, score-0.347]

18 Let the distribution P and class H be such that lim inf Rϕ ( f ) = Rϕ , λ→∞ f ∈Fλ where Rϕ = inf Rϕ ( f ) over all measurable functions. [sent-122, score-0.482]

19 The following set of conditions deals with uniform convergence and convergence of the boosting algorithm. [sent-126, score-0.263]

20 The main theorem (Theorem 1) shows that these, together with Condition 1, sufﬁce for consistency of the boosting procedure. [sent-127, score-0.364]

21 Assume, without loss of generality, that for ϕλ = infx∈[−λ,λ] ϕ(x), lim ϕλ = inf ϕ(x) = 0. [sent-148, score-0.27]

22 Then the boosting procedure stopped at step t n returns a sequence of classiﬁers ftn almost surely satisfying L(g( ftn )) → L as n → ∞. [sent-150, score-1.298]

23 n n n Rϕ (πζn ( ftn )) ≤ Rϕ,n (πζn ( ftn )) + ε1 (ω) by (3) n ≤ Rϕ,n ( ftn ) + ε1 (ω) + ϕζn n ¯n ) + ε1 (ω) + ϕζ + ε2 (ω) by (5) ≤ Rϕ,n ( f n n n 1 2 ≤ Rϕ ( f¯n ) + εn (ω) + ϕζn + εn (ω) + ε3 (ω) by (4). [sent-153, score-1.413]

24 But for ζn > 0 we have g(πζn ( ftn )) = g( ftn ), therefore a. [sent-162, score-0.942]

25 Hence, the boosting procedure is consistent if stopped after t n steps. [sent-165, score-0.265]

26 The almost sure formulation of Condition 2 does not provide explicit rates of convergence of L(g( ftn )) to L . [sent-166, score-0.503]

27 1 Uniform Convergence of tn -Combinations Here we show that Condition 2 (a) is satisﬁed for a variety of functions ϕ, and in particular for exponential loss used in AdaBoost. [sent-171, score-0.22]

28 1): Lemma 3 For any t ∈ N if dVC (H ) ≥ 2 the following holds: dP (F t ) ≤ 2(t + 1)(dVC (H ) + 1) log2 [2(t + 1)/ ln 2], where dP (F t ) is the pseudo-dimension of class F t . [sent-173, score-0.274]

29 The proof of consistency is based on the following result, which builds on the result by Koltchinskii and Panchenko (2002) and resembles a lemma due to Lugosi and Vayatis (2004, Lemma 2). [sent-174, score-0.212]

30 x∈[−ζ,ζ] Then for V = dVC (H ), c = 24 R1 0 ln 8e dε and any n, ζ > 0 and t > 0, ε2 (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] . [sent-176, score-0.548]

31 n E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ f ∈πζ ◦F t Also, for any δ > 0, with probability at least 1 − δ, (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] n sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ f ∈πζ ◦F t + Mϕ,ζ ln(1/δ) . [sent-177, score-0.518]

32 , ϕ(x) = e −x ) we can choose ζ = κ ln n and t = n1−ε with κ > 0, ε ∈ (0, 1) and 2κ − ε < 0. [sent-183, score-0.274]

33 Theorem 5 Deﬁne the variation of a function ϕ on the interval [−a, a] (for a > 0) as Vϕ,a = sup ϕ(x) − inf ϕ(x). [sent-187, score-0.328]

34 x∈[−a,a] x∈[−a,a] If a sequence { f¯n }∞ satisﬁes the condition f¯n (x) ∈ [−λn , λn ], ∀x ∈ X , where λn > 0 is chosen so n=1 that Vϕ,λn = o( n/ ln n), then a. [sent-188, score-0.374]

35 These restrictions are satisﬁed if 2 Vϕ,λn = o n ln n and the statement of the theorem follows. [sent-195, score-0.397]

36 In the case of the AdaBoost loss ϕ(x) = e−x we shall choose λn = κ ln n, where κ ∈ (0, 1/2). [sent-197, score-0.35]

37 Deﬁne f k+1 = fk + αk+1 hk+1 such that Q( fk+1 ) ≤ γ inf Q( fk + αh) + (1 − γ)Q( f k ) h∈H ,α∈R and Q( fk+1 ) = inf Q( fk + αhk+1 ). [sent-208, score-0.868]

38 α∈R 2354 A DA B OOST IS C ONSISTENT Then for any reference function f¯ and the sequence of functions f m , produced by the boosting algorithm, the following bound holds ∀m > 0 such that Q( f m ) > Q( f¯). [sent-209, score-0.31]

39 Q( fm ) ≤ Q( f¯) + 8B3 (Q( f0 ) − Q( f¯))2 γ 2 β3 ln 2 +c m 3 0 2 0 1 −2 , (11) where k = f¯ − fk , c3 = 2(Q( f0 ) − Q( f¯))/β, β = inf{Q ( f ; h) : Q( f¯) < Q( f ) < Q( f 0 ), h ∈ H }, B = sup{Q ( f ; h) : Q( f ) < Q( f 0 ), h ∈ H }. [sent-210, score-0.813]

40 1) provide similar bounds under either an assumption of a bounded step size of the boosting algorithm or a positive lower bound on Q ( f ; h) for all f , h. [sent-217, score-0.223]

41 Since we consider boosting algorithms with unrestricted step size, the only option would be to assume a positive lower bound on the second derivative. [sent-218, score-0.223]

42 It is easy to see, that the theorem above applies to the AdaBoost algorithm, since there we ﬁrst choose the direction (base classiﬁer) hi and then we compute the step size αi as αi = 1 1 − εi 1 R( fi ) − R ( fi ; hi ) ln = ln . [sent-221, score-0.871]

43 2 εi 2 R( fi ) + R ( fi ; hi ) Now we only have to recall that this value of αi corresponds to exact minimization in the direction hi . [sent-222, score-0.238]

44 Then with probability at least 1 − δn , where δn = exp −2(R )2 n1−2κ /a2 , the following holds √ −1/2 2 2(1 − R (a − 1)/a) λ2 + 2tn (a/(a − 1) − R )/R n ¯n ) + ln Rn ( f t n ) ≤ R n ( f . [sent-233, score-0.274]

45 It is easy to see that choice of λn ’s in the above theorem ensures that ∑∞ δn < ∞, therefore n=1 Borel-Cantelli lemma guarantees that for tn → ∞ sufﬁciently fast (for example as O(nα ) for α ∈ (0, 1)) a. [sent-239, score-0.354]

46 n→∞ If in addition to the conditions of Theorem 8 we shall require that R( f¯n ) ≤ inf R( f ) + εn , f ∈Fλn for some εn → 0, then together with Condition 1 this will imply R( f¯n ) → R∗ as n → ∞ and Condition 2 (c) follows. [sent-242, score-0.25]

47 Corollary 9 Assume V = dVC (H ) < ∞, lim inf R( f ) = R λ→∞ f ∈Fλ and tn = n1−ε for ε ∈ (0, 1). [sent-245, score-0.426]

48 Then AdaBoost stopped at step tn returns a sequence of classiﬁers almost surely satisfying L(g( ftn )) → L . [sent-246, score-0.816]

49 As was suggested after the proof of Lemma 4 we may choose ζn = κ ln n for 2κ − ε < 0 (which also implies κ < 1/2) to satisfy Condition 2 (a). [sent-249, score-0.325]

50 Recall that discussion after the proof of the Theorem 8 suggests choice of the sequence { f¯n }∞ of reference functions such that f¯n ∈ Fλn and n=1 R( f¯n ) ≤ inf R( f ) + εn f ∈Fλn for εn → 0 and λn = κ ln n with κ ∈ (0, 1/2) to ensure that Condition 2 (c) holds. [sent-250, score-0.618]

51 Discussion We showed that AdaBoost is consistent if stopped sufﬁciently early, after t n iterations, for tn = O(n1−ε ) with ε ∈ (0, 1). [sent-255, score-0.254]

52 We do not know what happens in between O(n1−ε ) and O(n2 ln n). [sent-258, score-0.274]

53 The AdaBoost algorithm, as well as other versions of the boosting procedure, replaces the 0 − 1 loss with a convex function ϕ to overcome algorithmic difﬁculties associated with the non-convex optimization problem. [sent-260, score-0.231]

54 We may choose tn , λn , ζn the same as for the exponential loss or set λn = n1/4−ϑ1 , ϑ1 ∈ (0, 1/4), ζn = nρ−ϑ2 , ϑ2 = (0, ρ), ρ = min(ε/2, 1/4) to get the following analog of Corollary 9. [sent-273, score-0.251]

55 Assume V = dVC (H ) < ∞, lim inf R( f ) = R λ→∞ f ∈Fλ and tn = n1−ε for ε ∈ (0, 1). [sent-275, score-0.426]

56 Then boosting procedure stopped at step t n returns a sequence of classiﬁers almost surely satisfying L(g( ftn )) → L . [sent-276, score-0.827]

57 For example for logit loss ϕ(x) = ln(1 + e−x ), Lemma 4 and Theorem 5 work, since Lϕ,λ = 1 and Mϕ,λ = ln(1 + eλ ), hence choosing tn , λn , ζn as for either the exponential or quadratic losses will work. [sent-278, score-0.243]

58 Rate of Convergence of L(g( ftn )) to L Here we formulate Condition 2 in a stricter form and prove consistency along with a rate of convergence of the boosting procedure to the Bayes risk. [sent-287, score-0.782]

59 P sup f ∈πζn ◦F tn |Rϕ ( f ) − Rϕ,n ( f )| > ε1 n < δ1 . [sent-292, score-0.31]

60 2358 (14) A DA B OOST IS C ONSISTENT Theorem 11 Assume ϕ is classiﬁcation calibrated and convex, and for ϕ λ = infx∈[−λ,λ] ϕ(x) without loss of generality assume lim ϕλ = inf ϕ(x) = 0. [sent-300, score-0.312]

61 Then the boosting procedure stopped at step t n returns a sequence of classiﬁers ftn almost surely satisfying L(g( ftn )) → L as n → ∞. [sent-302, score-1.298]

62 Rϕ (πζn ( ftn )) ≤ Rϕ,n (πζn ( ftn )) + ε1 n ≤ ≤ ≤ by (12) Rϕ,n ( ftn ) + ε1 + ϕζn n ¯n ) + ε1 + ϕζ + ε3 by Rϕ,n ( f n n n Rϕ ( f¯n ) + ε1 + ϕζn + ε3 + ε2 n n n (16) (14) (17) by (13). [sent-304, score-1.413]

63 Now we appeal to the Borel-Cantelli lemma and arrive at Rϕ (πζn ( ftn )) → R a. [sent-307, score-0.588]

64 But for ζn > 0 we have g(πζn ( ftn )) = g( ftn ), therefore a. [sent-313, score-0.942]

65 Hence the boosting procedure is consistent if stopped after t n steps. [sent-316, score-0.265]

66 Then there exists a non-decreasing function ψ, such that ψ(0) = 0, and with probability at least 1 − δ 1 − δ2 − δ3 n n n L(g( ftn )) − L ≤ ψ−1 (ε1 + ε2 + ε3 + ϕζn ) + n n n inf Rϕ − Rϕ f ∈Fλn where ψ−1 is the inverse of ψ. [sent-319, score-0.677]

67 (2006), if φ is convex we have that ψ(θ) = φ(0) − inf 1+θ 1−θ φ(α) + φ(−α) : α ∈ R , 2 2 and for any distribution and any measurable function f L(g( f )) − L ≤ ψ−1 Rϕ ( f ) − Rϕ . [sent-321, score-0.244]

68 2359 , (19) BARTLETT AND T RASKIN On the other hand, Rϕ ( f ) − Rϕ = Rϕ ( f ) − inf Rϕ + f ∈Fλn inf Rϕ − Rϕ . [sent-322, score-0.412]

69 f ∈Fλn The proof of Theorem 11 shows that for function f tn with probability at least 1 − δ1 − δ2 n n Rϕ ( ftn ) − inf Rϕ ≤ ε1 + ε2 + ε3 + ϕζn . [sent-323, score-0.916]

70 x∈[−ζ,ζ] Then for V = dVC (H ), c = 24 R1 0 ln 8e dε and any n, ζ > 0 and t > 0, ε2 E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ f ∈πζ ◦F t (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] . [sent-333, score-0.67]

71 n Also, for any δ > 0, with probability at least 1 − δ, sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ f ∈πζ ◦F t (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] n ln(1/δ) . [sent-334, score-0.396]

72 2n Proof The proof of this lemma is similar to the proof of Lugosi and Vayatis (2004, Lemma 2) in that we begin with symmetrization followed by the application of the “contraction principle”. [sent-335, score-0.183]

73 We use symmetrization to get + Mϕ,ζ 1 n ∑ σi (ϕ(−Yi f (Xi )) − ϕ(0)) , f ∈πζ ◦F t n i=1 E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ 2E sup f ∈πζ ◦F t 2360 A DA B OOST IS C ONSISTENT where σi are i. [sent-336, score-0.244]

74 112–113) with a function ψ(x) = (ϕ(x) − ϕ(0))/Lϕ,ζ to get 1 n ∑ −σiYi f (Xi ) f ∈πζ ◦F t n i=1 E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ 4Lϕ,ζ E sup f ∈πζ ◦F t 1 n ∑ σi f (Xi ) . [sent-342, score-0.244]

75 Notice, that functions in π ζ ◦ F t are bounded and clipped to absolute value equal ζ, therefore we can rescale πζ ◦ F t by (2ζ)−1 and get 1 n 1 n ∑ σi f (Xi ) = 2ζE sup t n ∑ σi f (Xi ) . [sent-344, score-0.193]

76 f ∈πζ ◦F t n i=1 f ∈(2ζ)−1 ◦πζ ◦F i=1 E sup Next, we use Dudley’s entropy integral (Dudley, 1999) to bound the right hand side above E 12 1 n ∑ σi f (Xi ) ≤ √n f ∈(2ζ)−1 ◦πζ ◦F t n i=1 sup Z ∞ 0 ln N (ε, (2ζ)−1 ◦ πζ ◦ F t , L2 (Pn ))dε. [sent-345, score-0.542]

77 n And then, since Lemma 3 gives an upper-bound on the pseudo-dimension of the class F t , we have 1 n ∑ σi f (Xi ) ≤ cζ f ∈πζ ◦F t n i=1 E sup (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] , n 2361 BARTLETT AND T RASKIN with the constant c above being independent of H , t and ζ. [sent-350, score-0.396]

78 , n} sup sup |Rϕ ( f ) − Rϕ,n ( f )| − sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ (x j ,y j )n ,(xi ,yi ) f ∈πζ ◦F t j=1 f ∈πζ ◦F t Mϕ,ζ , n where Rϕ,n ( f ) is obtained from Rϕ,n ( f ) by changing each pair (xi , yi ) to an independent pair (xi , yi ). [sent-357, score-0.416]

79 Deﬁne f k+1 = fk + αk+1 hk+1 such that Q( fk+1 ) ≤ γ inf Q( fk + αh) + (1 − γ)Q( f k ) h∈H ,α∈R and Q( fk+1 ) = inf Q( fk + αhk+1 ). [sent-365, score-0.868]

80 α∈R Then for any reference function f¯ and the sequence of functions f m , produced by the boosting algorithm, the following bound holds ∀m > 0 such that Q( f m ) > Q( f¯). [sent-366, score-0.31]

81 Q( fm ) ≤ Q( f¯) + 8B3 (Q( f0 ) − Q( f¯))2 γ 2 β3 ln 2 +c m 3 0 2 0 1 −2 , where k = f¯ − fk , c3 = 2(Q( f0 ) − Q( f¯))/β, β = inf{Q ( f ; h) : Q( f¯) < Q( f ) < Q( f 0 ), h ∈ H }, B = sup{Q ( f ; h) : Q( f ) < Q( f 0 ), h ∈ H }. [sent-367, score-0.813]

82 (20) ˜ ˜ ˜ ˜ ˜ Let fm − f¯ = ∑ αi hi , where αi and hi correspond to the best representation (with the l1 -norm of α equal the l -norm). [sent-375, score-0.557]

83 Then from (20) and linearity of the derivative we have εm ≤ ˜ ˜ ∑ αi Q ( f m ; h i ) ˜ ≤ sup |Q ( fm ; h)| ∑ |αi |, h∈H 2362 A DA B OOST IS C ONSISTENT therefore sup Q ( fm ; h) ≥ h∈H εm fm − f¯ = εm . [sent-376, score-1.434]

84 (21) m Next, 1 Q( fm + αhm ) = Q( fm ) + αQ ( fm ; hm ) + α2 Q ( f˜m ; hm ), 2 ˜ ˜ where f˜m = fm + αm hm , for αm ∈ [0, αm ]. [sent-377, score-1.947]

85 By assumption f˜m is on the path from f m to fm+1 , and we have assumed exact minimization in the given direction, hence f m+1 is the lowest point in the direction hm starting from f m , so we have the following bounds Q( f¯) < Q( fm+1 ) ≤ Q( f˜m ) ≤ Q( fm ) ≤ Q( f0 ). [sent-378, score-0.52]

86 Then by the deﬁnition of β, which depends on Q( f¯), we have 1 |Q ( fm ; hm )|2 Q( fm+1 ) ≥ Q( fm ) + inf (αQ ( fm ; hm ) + α2 β) = Q( fm ) − . [sent-379, score-2.02]

87 α∈R 2 2β (22) On the other hand, Q( fm + αm hm ) ≤ γ ≤ γ inf h∈H ,α∈R Q( fm + αh) + (1 − γ)Q( f m ) inf h∈H ,α∈R = Q( fm ) − γ 1 Q( fm ) + αQ ( fm ; h) + α2 B) + (1 − γ)Q( f m ) 2 suph∈H |Q ( fm ; h)|2 . [sent-380, score-2.867]

88 2B (23) Therefore, combining (22) and (23), we get |Q ( fm ; hm )| ≥ sup |Q ( fm ; h)| h∈H γβ . [sent-381, score-1.029]

89 B (24) Another Taylor expansion, this time around f m+1 (and we again use the fact that f m+1 is the minimum on the path from f m ), gives us 1 Q( fm ) = Q( fm+1 ) + α2 Q ( ˜˜m ; hm ), f 2 m (25) √ where ˜˜m is some (other) function on the path from f m to fm+1 . [sent-382, score-0.52]

91 dx 1 a + b(m + 1) = ln , a + bx b a then γ 2 β2 2 (Q( f0 ) − Q( f¯)) ≥ 3 ε2 ln β 4B (Q( f0 ) − Q( f¯)) m Therefore εm ≤ 1/2  8B3 (Q( f0 ) − Q( f¯))2  ln γ 2 β3 2 + 2(Q( f0 )−Q( f¯)) (m + 1) 0 β . [sent-386, score-0.822]

92 If f¯ is such that Q( fm ) ≥ Q( f¯) for all m, then we do not need to do anything else. [sent-389, score-0.387]

93 However, if there exists m such that Q( fm ) < Q( f¯) and Q( fm −1 ) ≥ Q( f¯), then the above proof is not valid for index m − 1. [sent-390, score-0.825]

94 It also implies that the fastest decrease rate of the function τ : λ → inf f ∈Fλ R( f ) is O(e−λ ). [sent-403, score-0.206]

95 Then inf f ∈Fλn R( f ) ≥ C(ln n)−κ , where C depends on H and P , but does not depend on n. [sent-409, score-0.206]

96 Due to convexity, for any z < λ we have ϕ(z) > ϕ(λ), therefore ϕ(πλ (x)) = ϕ(λ) ≤ ϕ(λ) + ϕ(x) = inf ϕ(z) + ϕ(x). [sent-425, score-0.206]

97 An empirical comparison of voting classiﬁcation algorithms: Bagging, boosting and variants. [sent-444, score-0.199]

98 On the rate of convergence of regularized a boosting classiﬁers. [sent-451, score-0.231]

99 On weak base hypotheses and their implications for boosting regression and classiﬁcation. [sent-522, score-0.247]

100 How boosting the margin can also boost classiﬁer complexity. [sent-567, score-0.221]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ftn', 0.471), ('fm', 0.387), ('adaboost', 0.275), ('ln', 0.274), ('inf', 0.206), ('boosting', 0.199), ('tn', 0.188), ('raskin', 0.173), ('fk', 0.152), ('bartlett', 0.148), ('oost', 0.133), ('hm', 0.133), ('sup', 0.122), ('onsistent', 0.119), ('dvc', 0.11), ('rn', 0.106), ('da', 0.089), ('theorem', 0.085), ('hi', 0.085), ('lemma', 0.081), ('consistency', 0.08), ('dp', 0.071), ('clipped', 0.071), ('bickel', 0.071), ('stopped', 0.066), ('jiang', 0.061), ('ft', 0.059), ('suitably', 0.057), ('yoav', 0.054), ('fn', 0.054), ('mannor', 0.053), ('sequence', 0.053), ('proof', 0.051), ('leo', 0.051), ('vayatis', 0.051), ('lugosi', 0.049), ('freund', 0.049), ('suph', 0.047), ('condition', 0.047), ('shall', 0.044), ('bagging', 0.042), ('calibrated', 0.042), ('stopping', 0.041), ('breiman', 0.041), ('peter', 0.041), ('hk', 0.04), ('risk', 0.04), ('surely', 0.038), ('schapire', 0.038), ('statement', 0.038), ('berkeley', 0.038), ('measurable', 0.038), ('corollary', 0.038), ('appeal', 0.036), ('hoeffding', 0.035), ('iterations', 0.034), ('reference', 0.034), ('fi', 0.034), ('contraction', 0.033), ('koltchinskii', 0.033), ('pollard', 0.033), ('ers', 0.032), ('loss', 0.032), ('convergence', 0.032), ('robert', 0.032), ('lim', 0.032), ('annals', 0.031), ('michel', 0.031), ('reyzin', 0.031), ('shie', 0.031), ('classi', 0.031), ('analog', 0.031), ('bauer', 0.03), ('mason', 0.03), ('derivative', 0.029), ('arcing', 0.028), ('anthony', 0.028), ('tong', 0.027), ('drucker', 0.027), ('infx', 0.027), ('ledoux', 0.027), ('wenxin', 0.027), ('inequality', 0.027), ('mum', 0.025), ('yi', 0.025), ('base', 0.025), ('xi', 0.025), ('bor', 0.024), ('ron', 0.024), ('panchenko', 0.024), ('appendix', 0.024), ('er', 0.024), ('bound', 0.024), ('satis', 0.023), ('lipschitz', 0.023), ('quadratic', 0.023), ('weak', 0.023), ('trivially', 0.022), ('gn', 0.022), ('margin', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 9 jmlr-2007-AdaBoost is Consistent

Author: Peter L. Bartlett, Mikhail Traskin

2 0.17451113 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring

Author: András György, Tamás Linder, Gábor Lugosi, György Ottucsák

Abstract: The on-line shortest path problem is considered under various models of partial monitoring. Given a weighted directed acyclic graph whose edge weights can change in an arbitrary (adversarial) way, a decision maker has to choose in each round of a game a path between two distinguished vertices such that the loss of the chosen path (deﬁned as the sum of the weights of its composing edges) be as small as possible. In a setting generalizing the multi-armed bandit problem, after choosing a path, the decision maker learns only the weights of those edges that belong to the chosen path. For this problem, an algorithm is given whose average cumulative loss in n rounds exceeds that of the best path, matched off-line to the entire sequence of the edge weights, by a quantity that is √ proportional to 1/ n and depends only polynomially on the number of edges of the graph. The algorithm can be implemented with complexity that is linear in the number of rounds n (i.e., the average complexity per round is constant) and in the number of edges. An extension to the so-called label efﬁcient setting is also given, in which the decision maker is informed about the weights of the edges corresponding to the chosen path at a total of m n time instances. Another extension is shown where the decision maker competes against a time-varying path, a generalization of the problem of tracking the best expert. A version of the multi-armed bandit setting for shortest path is also discussed where the decision maker learns only the total weight of the chosen path but not the weights of the individual edges on the path. Applications to routing in packet switched networks along with simulation results are also presented. Keywords: on-line learning, shortest path problem, multi-armed bandit problem c 2007 Andr´ s Gy¨ rgy, Tam´ s Linder, G´ bor Lugosi and Gy¨ rgy Ottucs´ k. a o a a o a ¨ ´ G Y ORGY, L INDER , L UGOSI AND OTTUCS AK

3 0.16641811 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

Author: David Mease, Abraham J. Wyner, Andreas Buja

Abstract: The standard by which binary classiﬁers are usually judged, misclassiﬁcation error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classiﬁcation trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classiﬁcation with unequal costs or, equivalently, classiﬁcation at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We ﬁrst examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overﬁt P[y = 1|x] even though they perform well as classiﬁers. A major negative point of the present article is the disconnect between class probability estimation and classiﬁcation. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overﬁtting, but for arbitrary misclassiﬁcation costs and, equivalently, arbitrary quantile boundaries. We then use collections of classiﬁers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratiﬁcation, data jittering

4 0.12515453 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with ﬂexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with ﬂexible variances is a uniform Glivenko-Cantelli (uGC) class. This result conﬁrms a conjecture concerning learnability of Gaussian kernels and veriﬁes the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when ﬂexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with ﬂexible variances we present explicit learning rates for regression with least square loss and classiﬁcation with hinge loss. Keywords: Gaussian kernel, ﬂexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number

5 0.12492726 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

Author: Nicolás García-Pedrajas, César García-Osorio, Colin Fyfe

Abstract: In this paper we propose a novel approach for ensemble construction based on the use of nonlinear projections to achieve both accuracy and diversity of individual classiﬁers. The proposed approach combines the philosophy of boosting, putting more effort on difﬁcult instances, with the basis of the random subspace method. Our main contribution is that instead of using a random subspace, we construct a projection taking into account the instances which have posed most difﬁculties to previous classiﬁers. In this way, consecutive nonlinear projections are created by a neural network trained using only incorrectly classiﬁed instances. The feature subspace induced by the hidden layer of this network is used as the input space to a new classiﬁer. The method is compared with bagging and boosting techniques, showing an improved performance on a large set of 44 problems from the UCI Machine Learning Repository. An additional study showed that the proposed approach is less sensitive to noise in the data than boosting methods. Keywords: classiﬁer ensembles, boosting, neural networks, nonlinear projections

6 0.1217258 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

7 0.10091249 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

8 0.090011455 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

9 0.08998125 70 jmlr-2007-Ranking the Best Instances

10 0.088988505 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

11 0.083812289 90 jmlr-2007-Value Regularization and Fenchel Duality

12 0.073179118 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

13 0.069418952 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

14 0.060729083 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data

15 0.058869801 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers

16 0.05109255 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

17 0.049667366 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

18 0.042927057 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

19 0.039882269 71 jmlr-2007-Refinable Kernels

20 0.039200909 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.291), (1, -0.28), (2, -0.129), (3, -0.185), (4, 0.223), (5, 0.138), (6, -0.147), (7, 0.163), (8, -0.031), (9, -0.031), (10, 0.094), (11, -0.163), (12, -0.095), (13, -0.146), (14, -0.095), (15, -0.215), (16, 0.1), (17, -0.083), (18, 0.237), (19, 0.064), (20, 0.012), (21, -0.084), (22, -0.072), (23, -0.056), (24, -0.005), (25, 0.055), (26, -0.05), (27, 0.009), (28, 0.089), (29, -0.063), (30, -0.082), (31, 0.004), (32, -0.019), (33, 0.02), (34, 0.004), (35, 0.067), (36, -0.044), (37, 0.048), (38, -0.001), (39, -0.053), (40, 0.034), (41, 0.038), (42, -0.0), (43, -0.032), (44, 0.015), (45, 0.014), (46, -0.02), (47, 0.004), (48, 0.005), (49, -0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96649683 9 jmlr-2007-AdaBoost is Consistent

Author: Peter L. Bartlett, Mikhail Traskin

2 0.63333899 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

Author: David Mease, Abraham J. Wyner, Andreas Buja

3 0.52474815 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring

Author: András György, Tamás Linder, Gábor Lugosi, György Ottucsák

4 0.51564175 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

Author: Nicolás García-Pedrajas, César García-Osorio, Colin Fyfe

5 0.49071559 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: Binary classiﬁcation is a well studied special case of the classiﬁcation problem. Statistical properties of binary classiﬁers, such as consistency, have been investigated in a variety of settings. Binary classiﬁcation methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classiﬁcation method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufﬁcient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classiﬁcation, consistency, Bayes risk

6 0.48499233 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

7 0.41002908 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

8 0.4091475 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

9 0.36850235 70 jmlr-2007-Ranking the Best Instances

10 0.32588521 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

11 0.31962499 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

12 0.30061159 90 jmlr-2007-Value Regularization and Fenchel Duality

13 0.28336304 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data

14 0.22325444 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

15 0.21511793 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

16 0.21422234 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers

17 0.20514506 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

18 0.19054344 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

19 0.17158312 71 jmlr-2007-Refinable Kernels

20 0.16967089 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(12, 0.014), (15, 0.014), (28, 0.118), (40, 0.029), (60, 0.53), (80, 0.012), (85, 0.036), (98, 0.124)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95009977 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)

Author: Nikolas List, Hans Ulrich Simon

Abstract: We present a general decomposition algorithm that is uniformly applicable to every (suitably normalized) instance of Convex Quadratic Optimization and efﬁciently approaches an optimal solution. The number of iterations required to be within ε of optimality grows linearly with 1/ε and quadratically with the number m of variables. The working set selection can be performed in polynomial time. If we restrict our considerations to instances of Convex Quadratic Optimization with at most k0 equality constraints for some ﬁxed constant k0 plus some so-called box-constraints (conditions that hold for most variants of SVM-optimization), the working set is found in linear time. Our analysis builds on a generalization of the concept of rate certifying pairs that was introduced by Hush and Scovel. In order to extend their results to arbitrary instances of Convex Quadratic Optimization, we introduce the general notion of a rate certifying q-set. We improve on the results by Hush and Scovel (2003) in several ways. First our result holds for Convex Quadratic Optimization whereas the results by Hush and Scovel are specialized to SVM-optimization. Second, we achieve a higher rate of convergence even for the special case of SVM-optimization (despite the generality of our approach). Third, our analysis is technically simpler. We prove furthermore that the strategy for working set selection which is based on rate certifying sets coincides with a strategy which is based on a so-called “sparse witness of sub-optimality”. Viewed from this perspective, our main result improves on convergence results by List and Simon (2004) and Simon (2004) by providing convergence rates (and by holding under more general conditions). Keywords: convex quadratic optimization, decomposition algorithms, support vector machines

same-paper 2 0.92605805 9 jmlr-2007-AdaBoost is Consistent

Author: Peter L. Bartlett, Mikhail Traskin

3 0.90235025 43 jmlr-2007-Integrating Naïve Bayes and FOIL

Author: Niels Landwehr, Kristian Kersting, Luc De Raedt

Abstract: A novel relational learning approach that tightly integrates the na¨ve Bayes learning scheme with ı the inductive logic programming rule-learner FOIL is presented. In contrast to previous combinations that have employed na¨ve Bayes only for post-processing the rule sets, the presented approach ı employs the na¨ve Bayes criterion to guide its search directly. The proposed technique is impleı mented in the N FOIL and T FOIL systems, which employ standard na¨ve Bayes and tree augmented ı na¨ve Bayes models respectively. We show that these integrated approaches to probabilistic model ı and rule learning outperform post-processing approaches. They also yield signiﬁcantly more accurate models than simple rule learning and are competitive with more sophisticated ILP systems. Keywords: rule learning, na¨ve Bayes, statistical relational learning, inductive logic programming ı

4 0.89864486 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data

Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin

Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classiﬁer achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classiﬁer achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassiﬁcations. Keywords: set covering machines, sample-compression, loss bounds

5 0.61491901 71 jmlr-2007-Refinable Kernels

Author: Yuesheng Xu, Haizhang Zhang

Abstract: Motivated by mathematical learning from training data, we introduce the notion of reﬁnable kernels. Various characterizations of reﬁnable kernels are presented. The concept of reﬁnable kernels leads to the introduction of wavelet-like reproducing kernels. We also investigate a reﬁnable kernel that forms a Riesz basis. In particular, we characterize reﬁnable translation invariant kernels, and reﬁnable kernels deﬁned by reﬁnable functions. This study leads to multiresolution analysis of reproducing kernel Hilbert spaces. Keywords: reﬁnable kernels, reﬁnable feature maps, wavelet-like reproducing kernels, dual kernels, learning with kernels, reproducing kernel Hilbert spaces, Riesz bases

6 0.6141693 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

7 0.60408247 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

8 0.59137166 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

9 0.59032434 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

10 0.56824738 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

11 0.56275773 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

12 0.54435652 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

13 0.50036567 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

14 0.49752054 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

15 0.4949379 44 jmlr-2007-Large Margin Semi-supervised Learning

16 0.49291277 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

17 0.48341072 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

18 0.48300341 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

19 0.47777101 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

20 0.46998334 47 jmlr-2007-Learning Horn Expressions with LOGAN-H