jmlr jmlr2007 jmlr2007-9 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency
Reference: text
sentIndex sentText sentNum sentScore
1 While empirical studies show that boosting is one of the best off the shelf classification algorithms (see Breiman, 1998) theoretical results do not give a complete explanation of their effectiveness. [sent-12, score-0.199]
2 Later Breiman (1997, 1998, 2004) pointed out that boosting is a gradient descent type algorithm (see also Friedman et al. [sent-14, score-0.199]
3 Experimental results by Drucker and Cortes (1996), Quinlan (1996), Breiman (1998), Bauer and Kohavi (1999) and Dietterich (2000) showed that boosting is a very effective method, that often leads to a low test error. [sent-17, score-0.199]
4 It was also noted that boosting continues to decrease test error long after the sample error becomes zero: though it keeps adding more weak classifiers to the linear combination of classifiers, the generalization error, perhaps surprisingly, usually does not increase. [sent-18, score-0.222]
5 However some of the experiments suggested that there might be problems, since boosting performed worse than bagging in the presence of noise (Dietterich, 2000), and boosting concentrated not only on the “hard” areas, but also on outliers and noise (Bauer and Kohavi, 1999). [sent-19, score-0.44]
6 Upper bounds on the risk of boosted classifiers were obtained, based on the fact that boosting tends to maximize the margin of the training examples (Schapire et al. [sent-26, score-0.261]
7 , 1998; Koltchinskii and Panchenko, 2002), but Breiman (1999) pointed out that margin-based bounds do not completely explain the success of boosting methods. [sent-27, score-0.199]
8 Jiang (2004) analyses the unmodified boosting algorithm and proves a process consistency property, under certain assumptions. [sent-36, score-0.279]
9 Process consistency means that there exists a sequence (tn ) such that if AdaBoost with sample size n is stopped after tn iterations, its risk approaches the Bayes risk. [sent-37, score-0.427]
10 Of course we want this probability to be as small as possible and close to the Bayes risk L = inf L(g) = E(min{η(X), 1 − η(X)}), g 2348 A DA B OOST IS C ONSISTENT where the infimum is taken over all possible (measurable) classifiers and η(·) is a conditional probability η(x) = P(Y = 1|X = x). [sent-57, score-0.246]
11 Then the boosting procedure can be described as follows. [sent-69, score-0.199]
12 Rϕ,n ( fk ) ≤ γ inf h∈H ,α∈R Rϕ,n ( fk−1 + αh) + (1 − γ)Rϕ,n ( fk−1 ). [sent-78, score-0.358]
13 2349 BARTLETT AND T RASKIN We shall also use the convex hull of H scaled by λ ≥ 0, n n i=1 Fλ = i=1 f f = ∑ λi hi , n ∈ N ∪ {0}, λi ≥ 0, ∑ λi = λ, hi ∈ H as well as the set of k-combinations, k ∈ N, of functions in H k f f = ∑ λi hi , λi ∈ R, hi ∈ H Fk= . [sent-84, score-0.384]
14 i=1 We also need to define the l -norm: for any f ∈ F f = inf ∑ |αi |, f = ∑ αi hi , hi ∈ H . [sent-85, score-0.376]
15 However boosting in general and AdaBoost in particular may produce functions that cannot be appropriately bounded. [sent-105, score-0.199]
16 Furthermore, the complexity of the clipped class (as measured by its pseudo-dimension—see Pollard, 1984) grows slowly with the stopping time t, so we can show that the ϕ-risk of a clipped function is not much larger than its empirical ϕ-risk. [sent-109, score-0.183]
17 In order to compare the empirical ϕ-risk of the clipped function to that of a suitable reference sequence f¯n , we first use the fact that the empirical ϕ-risk of a clipped function πλ ◦ ft is not much larger than the empirical ϕ-risk of ft . [sent-111, score-0.347]
18 Let the distribution P and class H be such that lim inf Rϕ ( f ) = Rϕ , λ→∞ f ∈Fλ where Rϕ = inf Rϕ ( f ) over all measurable functions. [sent-122, score-0.482]
19 The following set of conditions deals with uniform convergence and convergence of the boosting algorithm. [sent-126, score-0.263]
20 The main theorem (Theorem 1) shows that these, together with Condition 1, suffice for consistency of the boosting procedure. [sent-127, score-0.364]
21 Assume, without loss of generality, that for ϕλ = infx∈[−λ,λ] ϕ(x), lim ϕλ = inf ϕ(x) = 0. [sent-148, score-0.27]
22 Then the boosting procedure stopped at step t n returns a sequence of classifiers ftn almost surely satisfying L(g( ftn )) → L as n → ∞. [sent-150, score-1.298]
23 n n n Rϕ (πζn ( ftn )) ≤ Rϕ,n (πζn ( ftn )) + ε1 (ω) by (3) n ≤ Rϕ,n ( ftn ) + ε1 (ω) + ϕζn n ¯n ) + ε1 (ω) + ϕζ + ε2 (ω) by (5) ≤ Rϕ,n ( f n n n 1 2 ≤ Rϕ ( f¯n ) + εn (ω) + ϕζn + εn (ω) + ε3 (ω) by (4). [sent-153, score-1.413]
24 But for ζn > 0 we have g(πζn ( ftn )) = g( ftn ), therefore a. [sent-162, score-0.942]
25 Hence, the boosting procedure is consistent if stopped after t n steps. [sent-165, score-0.265]
26 The almost sure formulation of Condition 2 does not provide explicit rates of convergence of L(g( ftn )) to L . [sent-166, score-0.503]
27 1 Uniform Convergence of tn -Combinations Here we show that Condition 2 (a) is satisfied for a variety of functions ϕ, and in particular for exponential loss used in AdaBoost. [sent-171, score-0.22]
28 1): Lemma 3 For any t ∈ N if dVC (H ) ≥ 2 the following holds: dP (F t ) ≤ 2(t + 1)(dVC (H ) + 1) log2 [2(t + 1)/ ln 2], where dP (F t ) is the pseudo-dimension of class F t . [sent-173, score-0.274]
29 The proof of consistency is based on the following result, which builds on the result by Koltchinskii and Panchenko (2002) and resembles a lemma due to Lugosi and Vayatis (2004, Lemma 2). [sent-174, score-0.212]
30 x∈[−ζ,ζ] Then for V = dVC (H ), c = 24 R1 0 ln 8e dε and any n, ζ > 0 and t > 0, ε2 (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] . [sent-176, score-0.548]
31 n E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ f ∈πζ ◦F t Also, for any δ > 0, with probability at least 1 − δ, (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] n sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ f ∈πζ ◦F t + Mϕ,ζ ln(1/δ) . [sent-177, score-0.518]
32 , ϕ(x) = e −x ) we can choose ζ = κ ln n and t = n1−ε with κ > 0, ε ∈ (0, 1) and 2κ − ε < 0. [sent-183, score-0.274]
33 Theorem 5 Define the variation of a function ϕ on the interval [−a, a] (for a > 0) as Vϕ,a = sup ϕ(x) − inf ϕ(x). [sent-187, score-0.328]
34 x∈[−a,a] x∈[−a,a] If a sequence { f¯n }∞ satisfies the condition f¯n (x) ∈ [−λn , λn ], ∀x ∈ X , where λn > 0 is chosen so n=1 that Vϕ,λn = o( n/ ln n), then a. [sent-188, score-0.374]
35 These restrictions are satisfied if 2 Vϕ,λn = o n ln n and the statement of the theorem follows. [sent-195, score-0.397]
36 In the case of the AdaBoost loss ϕ(x) = e−x we shall choose λn = κ ln n, where κ ∈ (0, 1/2). [sent-197, score-0.35]
37 Define f k+1 = fk + αk+1 hk+1 such that Q( fk+1 ) ≤ γ inf Q( fk + αh) + (1 − γ)Q( f k ) h∈H ,α∈R and Q( fk+1 ) = inf Q( fk + αhk+1 ). [sent-208, score-0.868]
38 α∈R 2354 A DA B OOST IS C ONSISTENT Then for any reference function f¯ and the sequence of functions f m , produced by the boosting algorithm, the following bound holds ∀m > 0 such that Q( f m ) > Q( f¯). [sent-209, score-0.31]
39 Q( fm ) ≤ Q( f¯) + 8B3 (Q( f0 ) − Q( f¯))2 γ 2 β3 ln 2 +c m 3 0 2 0 1 −2 , (11) where k = f¯ − fk , c3 = 2(Q( f0 ) − Q( f¯))/β, β = inf{Q ( f ; h) : Q( f¯) < Q( f ) < Q( f 0 ), h ∈ H }, B = sup{Q ( f ; h) : Q( f ) < Q( f 0 ), h ∈ H }. [sent-210, score-0.813]
40 1) provide similar bounds under either an assumption of a bounded step size of the boosting algorithm or a positive lower bound on Q ( f ; h) for all f , h. [sent-217, score-0.223]
41 Since we consider boosting algorithms with unrestricted step size, the only option would be to assume a positive lower bound on the second derivative. [sent-218, score-0.223]
42 It is easy to see, that the theorem above applies to the AdaBoost algorithm, since there we first choose the direction (base classifier) hi and then we compute the step size αi as αi = 1 1 − εi 1 R( fi ) − R ( fi ; hi ) ln = ln . [sent-221, score-0.871]
43 2 εi 2 R( fi ) + R ( fi ; hi ) Now we only have to recall that this value of αi corresponds to exact minimization in the direction hi . [sent-222, score-0.238]
44 Then with probability at least 1 − δn , where δn = exp −2(R )2 n1−2κ /a2 , the following holds √ −1/2 2 2(1 − R (a − 1)/a) λ2 + 2tn (a/(a − 1) − R )/R n ¯n ) + ln Rn ( f t n ) ≤ R n ( f . [sent-233, score-0.274]
45 It is easy to see that choice of λn ’s in the above theorem ensures that ∑∞ δn < ∞, therefore n=1 Borel-Cantelli lemma guarantees that for tn → ∞ sufficiently fast (for example as O(nα ) for α ∈ (0, 1)) a. [sent-239, score-0.354]
46 n→∞ If in addition to the conditions of Theorem 8 we shall require that R( f¯n ) ≤ inf R( f ) + εn , f ∈Fλn for some εn → 0, then together with Condition 1 this will imply R( f¯n ) → R∗ as n → ∞ and Condition 2 (c) follows. [sent-242, score-0.25]
47 Corollary 9 Assume V = dVC (H ) < ∞, lim inf R( f ) = R λ→∞ f ∈Fλ and tn = n1−ε for ε ∈ (0, 1). [sent-245, score-0.426]
48 Then AdaBoost stopped at step tn returns a sequence of classifiers almost surely satisfying L(g( ftn )) → L . [sent-246, score-0.816]
49 As was suggested after the proof of Lemma 4 we may choose ζn = κ ln n for 2κ − ε < 0 (which also implies κ < 1/2) to satisfy Condition 2 (a). [sent-249, score-0.325]
50 Recall that discussion after the proof of the Theorem 8 suggests choice of the sequence { f¯n }∞ of reference functions such that f¯n ∈ Fλn and n=1 R( f¯n ) ≤ inf R( f ) + εn f ∈Fλn for εn → 0 and λn = κ ln n with κ ∈ (0, 1/2) to ensure that Condition 2 (c) holds. [sent-250, score-0.618]
51 Discussion We showed that AdaBoost is consistent if stopped sufficiently early, after t n iterations, for tn = O(n1−ε ) with ε ∈ (0, 1). [sent-255, score-0.254]
52 We do not know what happens in between O(n1−ε ) and O(n2 ln n). [sent-258, score-0.274]
53 The AdaBoost algorithm, as well as other versions of the boosting procedure, replaces the 0 − 1 loss with a convex function ϕ to overcome algorithmic difficulties associated with the non-convex optimization problem. [sent-260, score-0.231]
54 We may choose tn , λn , ζn the same as for the exponential loss or set λn = n1/4−ϑ1 , ϑ1 ∈ (0, 1/4), ζn = nρ−ϑ2 , ϑ2 = (0, ρ), ρ = min(ε/2, 1/4) to get the following analog of Corollary 9. [sent-273, score-0.251]
55 Assume V = dVC (H ) < ∞, lim inf R( f ) = R λ→∞ f ∈Fλ and tn = n1−ε for ε ∈ (0, 1). [sent-275, score-0.426]
56 Then boosting procedure stopped at step t n returns a sequence of classifiers almost surely satisfying L(g( ftn )) → L . [sent-276, score-0.827]
57 For example for logit loss ϕ(x) = ln(1 + e−x ), Lemma 4 and Theorem 5 work, since Lϕ,λ = 1 and Mϕ,λ = ln(1 + eλ ), hence choosing tn , λn , ζn as for either the exponential or quadratic losses will work. [sent-278, score-0.243]
58 Rate of Convergence of L(g( ftn )) to L Here we formulate Condition 2 in a stricter form and prove consistency along with a rate of convergence of the boosting procedure to the Bayes risk. [sent-287, score-0.782]
59 P sup f ∈πζn ◦F tn |Rϕ ( f ) − Rϕ,n ( f )| > ε1 n < δ1 . [sent-292, score-0.31]
60 2358 (14) A DA B OOST IS C ONSISTENT Theorem 11 Assume ϕ is classification calibrated and convex, and for ϕ λ = infx∈[−λ,λ] ϕ(x) without loss of generality assume lim ϕλ = inf ϕ(x) = 0. [sent-300, score-0.312]
61 Then the boosting procedure stopped at step t n returns a sequence of classifiers ftn almost surely satisfying L(g( ftn )) → L as n → ∞. [sent-302, score-1.298]
62 Rϕ (πζn ( ftn )) ≤ Rϕ,n (πζn ( ftn )) + ε1 n ≤ ≤ ≤ by (12) Rϕ,n ( ftn ) + ε1 + ϕζn n ¯n ) + ε1 + ϕζ + ε3 by Rϕ,n ( f n n n Rϕ ( f¯n ) + ε1 + ϕζn + ε3 + ε2 n n n (16) (14) (17) by (13). [sent-304, score-1.413]
63 Now we appeal to the Borel-Cantelli lemma and arrive at Rϕ (πζn ( ftn )) → R a. [sent-307, score-0.588]
64 But for ζn > 0 we have g(πζn ( ftn )) = g( ftn ), therefore a. [sent-313, score-0.942]
65 Hence the boosting procedure is consistent if stopped after t n steps. [sent-316, score-0.265]
66 Then there exists a non-decreasing function ψ, such that ψ(0) = 0, and with probability at least 1 − δ 1 − δ2 − δ3 n n n L(g( ftn )) − L ≤ ψ−1 (ε1 + ε2 + ε3 + ϕζn ) + n n n inf Rϕ − Rϕ f ∈Fλn where ψ−1 is the inverse of ψ. [sent-319, score-0.677]
67 (2006), if φ is convex we have that ψ(θ) = φ(0) − inf 1+θ 1−θ φ(α) + φ(−α) : α ∈ R , 2 2 and for any distribution and any measurable function f L(g( f )) − L ≤ ψ−1 Rϕ ( f ) − Rϕ . [sent-321, score-0.244]
68 2359 , (19) BARTLETT AND T RASKIN On the other hand, Rϕ ( f ) − Rϕ = Rϕ ( f ) − inf Rϕ + f ∈Fλn inf Rϕ − Rϕ . [sent-322, score-0.412]
69 f ∈Fλn The proof of Theorem 11 shows that for function f tn with probability at least 1 − δ1 − δ2 n n Rϕ ( ftn ) − inf Rϕ ≤ ε1 + ε2 + ε3 + ϕζn . [sent-323, score-0.916]
70 x∈[−ζ,ζ] Then for V = dVC (H ), c = 24 R1 0 ln 8e dε and any n, ζ > 0 and t > 0, ε2 E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ f ∈πζ ◦F t (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] . [sent-333, score-0.67]
71 n Also, for any δ > 0, with probability at least 1 − δ, sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ f ∈πζ ◦F t (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] n ln(1/δ) . [sent-334, score-0.396]
72 2n Proof The proof of this lemma is similar to the proof of Lugosi and Vayatis (2004, Lemma 2) in that we begin with symmetrization followed by the application of the “contraction principle”. [sent-335, score-0.183]
73 We use symmetrization to get + Mϕ,ζ 1 n ∑ σi (ϕ(−Yi f (Xi )) − ϕ(0)) , f ∈πζ ◦F t n i=1 E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ 2E sup f ∈πζ ◦F t 2360 A DA B OOST IS C ONSISTENT where σi are i. [sent-336, score-0.244]
74 112–113) with a function ψ(x) = (ϕ(x) − ϕ(0))/Lϕ,ζ to get 1 n ∑ −σiYi f (Xi ) f ∈πζ ◦F t n i=1 E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ 4Lϕ,ζ E sup f ∈πζ ◦F t 1 n ∑ σi f (Xi ) . [sent-342, score-0.244]
75 Notice, that functions in π ζ ◦ F t are bounded and clipped to absolute value equal ζ, therefore we can rescale πζ ◦ F t by (2ζ)−1 and get 1 n 1 n ∑ σi f (Xi ) = 2ζE sup t n ∑ σi f (Xi ) . [sent-344, score-0.193]
76 f ∈πζ ◦F t n i=1 f ∈(2ζ)−1 ◦πζ ◦F i=1 E sup Next, we use Dudley’s entropy integral (Dudley, 1999) to bound the right hand side above E 12 1 n ∑ σi f (Xi ) ≤ √n f ∈(2ζ)−1 ◦πζ ◦F t n i=1 sup Z ∞ 0 ln N (ε, (2ζ)−1 ◦ πζ ◦ F t , L2 (Pn ))dε. [sent-345, score-0.542]
77 n And then, since Lemma 3 gives an upper-bound on the pseudo-dimension of the class F t , we have 1 n ∑ σi f (Xi ) ≤ cζ f ∈πζ ◦F t n i=1 E sup (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] , n 2361 BARTLETT AND T RASKIN with the constant c above being independent of H , t and ζ. [sent-350, score-0.396]
78 , n} sup sup |Rϕ ( f ) − Rϕ,n ( f )| − sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ (x j ,y j )n ,(xi ,yi ) f ∈πζ ◦F t j=1 f ∈πζ ◦F t Mϕ,ζ , n where Rϕ,n ( f ) is obtained from Rϕ,n ( f ) by changing each pair (xi , yi ) to an independent pair (xi , yi ). [sent-357, score-0.416]
79 Define f k+1 = fk + αk+1 hk+1 such that Q( fk+1 ) ≤ γ inf Q( fk + αh) + (1 − γ)Q( f k ) h∈H ,α∈R and Q( fk+1 ) = inf Q( fk + αhk+1 ). [sent-365, score-0.868]
80 α∈R Then for any reference function f¯ and the sequence of functions f m , produced by the boosting algorithm, the following bound holds ∀m > 0 such that Q( f m ) > Q( f¯). [sent-366, score-0.31]
81 Q( fm ) ≤ Q( f¯) + 8B3 (Q( f0 ) − Q( f¯))2 γ 2 β3 ln 2 +c m 3 0 2 0 1 −2 , where k = f¯ − fk , c3 = 2(Q( f0 ) − Q( f¯))/β, β = inf{Q ( f ; h) : Q( f¯) < Q( f ) < Q( f 0 ), h ∈ H }, B = sup{Q ( f ; h) : Q( f ) < Q( f 0 ), h ∈ H }. [sent-367, score-0.813]
82 (20) ˜ ˜ ˜ ˜ ˜ Let fm − f¯ = ∑ αi hi , where αi and hi correspond to the best representation (with the l1 -norm of α equal the l -norm). [sent-375, score-0.557]
83 Then from (20) and linearity of the derivative we have εm ≤ ˜ ˜ ∑ αi Q ( f m ; h i ) ˜ ≤ sup |Q ( fm ; h)| ∑ |αi |, h∈H 2362 A DA B OOST IS C ONSISTENT therefore sup Q ( fm ; h) ≥ h∈H εm fm − f¯ = εm . [sent-376, score-1.434]
84 (21) m Next, 1 Q( fm + αhm ) = Q( fm ) + αQ ( fm ; hm ) + α2 Q ( f˜m ; hm ), 2 ˜ ˜ where f˜m = fm + αm hm , for αm ∈ [0, αm ]. [sent-377, score-1.947]
85 By assumption f˜m is on the path from f m to fm+1 , and we have assumed exact minimization in the given direction, hence f m+1 is the lowest point in the direction hm starting from f m , so we have the following bounds Q( f¯) < Q( fm+1 ) ≤ Q( f˜m ) ≤ Q( fm ) ≤ Q( f0 ). [sent-378, score-0.52]
86 Then by the definition of β, which depends on Q( f¯), we have 1 |Q ( fm ; hm )|2 Q( fm+1 ) ≥ Q( fm ) + inf (αQ ( fm ; hm ) + α2 β) = Q( fm ) − . [sent-379, score-2.02]
87 α∈R 2 2β (22) On the other hand, Q( fm + αm hm ) ≤ γ ≤ γ inf h∈H ,α∈R Q( fm + αh) + (1 − γ)Q( f m ) inf h∈H ,α∈R = Q( fm ) − γ 1 Q( fm ) + αQ ( fm ; h) + α2 B) + (1 − γ)Q( f m ) 2 suph∈H |Q ( fm ; h)|2 . [sent-380, score-2.867]
88 2B (23) Therefore, combining (22) and (23), we get |Q ( fm ; hm )| ≥ sup |Q ( fm ; h)| h∈H γβ . [sent-381, score-1.029]
89 B (24) Another Taylor expansion, this time around f m+1 (and we again use the fact that f m+1 is the minimum on the path from f m ), gives us 1 Q( fm ) = Q( fm+1 ) + α2 Q ( ˜˜m ; hm ), f 2 m (25) √ where ˜˜m is some (other) function on the path from f m to fm+1 . [sent-382, score-0.52]
90 Therefore, if |αm | < γ|Q ( fm ; hm )|/B, f then γ|Q ( fm ; hm )|2 Q( fm ) − Q( fm+1 ) < , 2B but by (23) γ suph∈H |Q ( fm ; h)|2 γ|Q ( fm ; hm )|2 ≥ , Q( fm ) − Q( fm+1 ) ≥ 2B 2B therefore we conclude, by combining (24) and (21), that √ γ|Q ( fm ; hm )| γ β suph∈H |Q ( fm ; h)| γεm β |αm | ≥ ≥ . [sent-383, score-3.628]
91 dx 1 a + b(m + 1) = ln , a + bx b a then γ 2 β2 2 (Q( f0 ) − Q( f¯)) ≥ 3 ε2 ln β 4B (Q( f0 ) − Q( f¯)) m Therefore εm ≤ 1/2 8B3 (Q( f0 ) − Q( f¯))2 ln γ 2 β3 2 + 2(Q( f0 )−Q( f¯)) (m + 1) 0 β . [sent-386, score-0.822]
92 If f¯ is such that Q( fm ) ≥ Q( f¯) for all m, then we do not need to do anything else. [sent-389, score-0.387]
93 However, if there exists m such that Q( fm ) < Q( f¯) and Q( fm −1 ) ≥ Q( f¯), then the above proof is not valid for index m − 1. [sent-390, score-0.825]
94 It also implies that the fastest decrease rate of the function τ : λ → inf f ∈Fλ R( f ) is O(e−λ ). [sent-403, score-0.206]
95 Then inf f ∈Fλn R( f ) ≥ C(ln n)−κ , where C depends on H and P , but does not depend on n. [sent-409, score-0.206]
96 Due to convexity, for any z < λ we have ϕ(z) > ϕ(λ), therefore ϕ(πλ (x)) = ϕ(λ) ≤ ϕ(λ) + ϕ(x) = inf ϕ(z) + ϕ(x). [sent-425, score-0.206]
97 An empirical comparison of voting classification algorithms: Bagging, boosting and variants. [sent-444, score-0.199]
98 On the rate of convergence of regularized a boosting classifiers. [sent-451, score-0.231]
99 On weak base hypotheses and their implications for boosting regression and classification. [sent-522, score-0.247]
100 How boosting the margin can also boost classifier complexity. [sent-567, score-0.221]
wordName wordTfidf (topN-words)
[('ftn', 0.471), ('fm', 0.387), ('adaboost', 0.275), ('ln', 0.274), ('inf', 0.206), ('boosting', 0.199), ('tn', 0.188), ('raskin', 0.173), ('fk', 0.152), ('bartlett', 0.148), ('oost', 0.133), ('hm', 0.133), ('sup', 0.122), ('onsistent', 0.119), ('dvc', 0.11), ('rn', 0.106), ('da', 0.089), ('theorem', 0.085), ('hi', 0.085), ('lemma', 0.081), ('consistency', 0.08), ('dp', 0.071), ('clipped', 0.071), ('bickel', 0.071), ('stopped', 0.066), ('jiang', 0.061), ('ft', 0.059), ('suitably', 0.057), ('yoav', 0.054), ('fn', 0.054), ('mannor', 0.053), ('sequence', 0.053), ('proof', 0.051), ('leo', 0.051), ('vayatis', 0.051), ('lugosi', 0.049), ('freund', 0.049), ('suph', 0.047), ('condition', 0.047), ('shall', 0.044), ('bagging', 0.042), ('calibrated', 0.042), ('stopping', 0.041), ('breiman', 0.041), ('peter', 0.041), ('hk', 0.04), ('risk', 0.04), ('surely', 0.038), ('schapire', 0.038), ('statement', 0.038), ('berkeley', 0.038), ('measurable', 0.038), ('corollary', 0.038), ('appeal', 0.036), ('hoeffding', 0.035), ('iterations', 0.034), ('reference', 0.034), ('fi', 0.034), ('contraction', 0.033), ('koltchinskii', 0.033), ('pollard', 0.033), ('ers', 0.032), ('loss', 0.032), ('convergence', 0.032), ('robert', 0.032), ('lim', 0.032), ('annals', 0.031), ('michel', 0.031), ('reyzin', 0.031), ('shie', 0.031), ('classi', 0.031), ('analog', 0.031), ('bauer', 0.03), ('mason', 0.03), ('derivative', 0.029), ('arcing', 0.028), ('anthony', 0.028), ('tong', 0.027), ('drucker', 0.027), ('infx', 0.027), ('ledoux', 0.027), ('wenxin', 0.027), ('inequality', 0.027), ('mum', 0.025), ('yi', 0.025), ('base', 0.025), ('xi', 0.025), ('bor', 0.024), ('ron', 0.024), ('panchenko', 0.024), ('appendix', 0.024), ('er', 0.024), ('bound', 0.024), ('satis', 0.023), ('lipschitz', 0.023), ('quadratic', 0.023), ('weak', 0.023), ('trivially', 0.022), ('gn', 0.022), ('margin', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 9 jmlr-2007-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency
2 0.17451113 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
Author: András György, Tamás Linder, Gábor Lugosi, György Ottucsák
Abstract: The on-line shortest path problem is considered under various models of partial monitoring. Given a weighted directed acyclic graph whose edge weights can change in an arbitrary (adversarial) way, a decision maker has to choose in each round of a game a path between two distinguished vertices such that the loss of the chosen path (defined as the sum of the weights of its composing edges) be as small as possible. In a setting generalizing the multi-armed bandit problem, after choosing a path, the decision maker learns only the weights of those edges that belong to the chosen path. For this problem, an algorithm is given whose average cumulative loss in n rounds exceeds that of the best path, matched off-line to the entire sequence of the edge weights, by a quantity that is √ proportional to 1/ n and depends only polynomially on the number of edges of the graph. The algorithm can be implemented with complexity that is linear in the number of rounds n (i.e., the average complexity per round is constant) and in the number of edges. An extension to the so-called label efficient setting is also given, in which the decision maker is informed about the weights of the edges corresponding to the chosen path at a total of m n time instances. Another extension is shown where the decision maker competes against a time-varying path, a generalization of the problem of tracking the best expert. A version of the multi-armed bandit setting for shortest path is also discussed where the decision maker learns only the total weight of the chosen path but not the weights of the individual edges on the path. Applications to routing in packet switched networks along with simulation results are also presented. Keywords: on-line learning, shortest path problem, multi-armed bandit problem c 2007 Andr´ s Gy¨ rgy, Tam´ s Linder, G´ bor Lugosi and Gy¨ rgy Ottucs´ k. a o a a o a ¨ ´ G Y ORGY, L INDER , L UGOSI AND OTTUCS AK
3 0.16641811 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
Author: David Mease, Abraham J. Wyner, Andreas Buja
Abstract: The standard by which binary classifiers are usually judged, misclassification error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classification trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We first examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overfit P[y = 1|x] even though they perform well as classifiers. A major negative point of the present article is the disconnect between class probability estimation and classification. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries. We then use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratification, data jittering
4 0.12515453 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
5 0.12492726 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
Author: Nicolás García-Pedrajas, César García-Osorio, Colin Fyfe
Abstract: In this paper we propose a novel approach for ensemble construction based on the use of nonlinear projections to achieve both accuracy and diversity of individual classifiers. The proposed approach combines the philosophy of boosting, putting more effort on difficult instances, with the basis of the random subspace method. Our main contribution is that instead of using a random subspace, we construct a projection taking into account the instances which have posed most difficulties to previous classifiers. In this way, consecutive nonlinear projections are created by a neural network trained using only incorrectly classified instances. The feature subspace induced by the hidden layer of this network is used as the input space to a new classifier. The method is compared with bagging and boosting techniques, showing an improved performance on a large set of 44 problems from the UCI Machine Learning Repository. An additional study showed that the proposed approach is less sensitive to noise in the data than boosting methods. Keywords: classifier ensembles, boosting, neural networks, nonlinear projections
8 0.090011455 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
9 0.08998125 70 jmlr-2007-Ranking the Best Instances
10 0.088988505 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
11 0.083812289 90 jmlr-2007-Value Regularization and Fenchel Duality
12 0.073179118 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
13 0.069418952 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
14 0.060729083 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
15 0.058869801 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers
16 0.05109255 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
17 0.049667366 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
19 0.039882269 71 jmlr-2007-Refinable Kernels
20 0.039200909 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
topicId topicWeight
[(0, 0.291), (1, -0.28), (2, -0.129), (3, -0.185), (4, 0.223), (5, 0.138), (6, -0.147), (7, 0.163), (8, -0.031), (9, -0.031), (10, 0.094), (11, -0.163), (12, -0.095), (13, -0.146), (14, -0.095), (15, -0.215), (16, 0.1), (17, -0.083), (18, 0.237), (19, 0.064), (20, 0.012), (21, -0.084), (22, -0.072), (23, -0.056), (24, -0.005), (25, 0.055), (26, -0.05), (27, 0.009), (28, 0.089), (29, -0.063), (30, -0.082), (31, 0.004), (32, -0.019), (33, 0.02), (34, 0.004), (35, 0.067), (36, -0.044), (37, 0.048), (38, -0.001), (39, -0.053), (40, 0.034), (41, 0.038), (42, -0.0), (43, -0.032), (44, 0.015), (45, 0.014), (46, -0.02), (47, 0.004), (48, 0.005), (49, -0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.96649683 9 jmlr-2007-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency
2 0.63333899 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
Author: David Mease, Abraham J. Wyner, Andreas Buja
Abstract: The standard by which binary classifiers are usually judged, misclassification error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classification trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We first examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overfit P[y = 1|x] even though they perform well as classifiers. A major negative point of the present article is the disconnect between class probability estimation and classification. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries. We then use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratification, data jittering
3 0.52474815 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
Author: András György, Tamás Linder, Gábor Lugosi, György Ottucsák
Abstract: The on-line shortest path problem is considered under various models of partial monitoring. Given a weighted directed acyclic graph whose edge weights can change in an arbitrary (adversarial) way, a decision maker has to choose in each round of a game a path between two distinguished vertices such that the loss of the chosen path (defined as the sum of the weights of its composing edges) be as small as possible. In a setting generalizing the multi-armed bandit problem, after choosing a path, the decision maker learns only the weights of those edges that belong to the chosen path. For this problem, an algorithm is given whose average cumulative loss in n rounds exceeds that of the best path, matched off-line to the entire sequence of the edge weights, by a quantity that is √ proportional to 1/ n and depends only polynomially on the number of edges of the graph. The algorithm can be implemented with complexity that is linear in the number of rounds n (i.e., the average complexity per round is constant) and in the number of edges. An extension to the so-called label efficient setting is also given, in which the decision maker is informed about the weights of the edges corresponding to the chosen path at a total of m n time instances. Another extension is shown where the decision maker competes against a time-varying path, a generalization of the problem of tracking the best expert. A version of the multi-armed bandit setting for shortest path is also discussed where the decision maker learns only the total weight of the chosen path but not the weights of the individual edges on the path. Applications to routing in packet switched networks along with simulation results are also presented. Keywords: on-line learning, shortest path problem, multi-armed bandit problem c 2007 Andr´ s Gy¨ rgy, Tam´ s Linder, G´ bor Lugosi and Gy¨ rgy Ottucs´ k. a o a a o a ¨ ´ G Y ORGY, L INDER , L UGOSI AND OTTUCS AK
4 0.51564175 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
Author: Nicolás García-Pedrajas, César García-Osorio, Colin Fyfe
Abstract: In this paper we propose a novel approach for ensemble construction based on the use of nonlinear projections to achieve both accuracy and diversity of individual classifiers. The proposed approach combines the philosophy of boosting, putting more effort on difficult instances, with the basis of the random subspace method. Our main contribution is that instead of using a random subspace, we construct a projection taking into account the instances which have posed most difficulties to previous classifiers. In this way, consecutive nonlinear projections are created by a neural network trained using only incorrectly classified instances. The feature subspace induced by the hidden layer of this network is used as the input space to a new classifier. The method is compared with bagging and boosting techniques, showing an improved performance on a large set of 44 problems from the UCI Machine Learning Repository. An additional study showed that the proposed approach is less sensitive to noise in the data than boosting methods. Keywords: classifier ensembles, boosting, neural networks, nonlinear projections
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classification, consistency, Bayes risk
7 0.41002908 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
8 0.4091475 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
9 0.36850235 70 jmlr-2007-Ranking the Best Instances
10 0.32588521 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
11 0.31962499 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
12 0.30061159 90 jmlr-2007-Value Regularization and Fenchel Duality
13 0.28336304 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
14 0.22325444 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
15 0.21511793 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
16 0.21422234 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers
17 0.20514506 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
19 0.17158312 71 jmlr-2007-Refinable Kernels
20 0.16967089 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
topicId topicWeight
[(12, 0.014), (15, 0.014), (28, 0.118), (40, 0.029), (60, 0.53), (80, 0.012), (85, 0.036), (98, 0.124)]
simIndex simValue paperId paperTitle
Author: Nikolas List, Hans Ulrich Simon
Abstract: We present a general decomposition algorithm that is uniformly applicable to every (suitably normalized) instance of Convex Quadratic Optimization and efficiently approaches an optimal solution. The number of iterations required to be within ε of optimality grows linearly with 1/ε and quadratically with the number m of variables. The working set selection can be performed in polynomial time. If we restrict our considerations to instances of Convex Quadratic Optimization with at most k0 equality constraints for some fixed constant k0 plus some so-called box-constraints (conditions that hold for most variants of SVM-optimization), the working set is found in linear time. Our analysis builds on a generalization of the concept of rate certifying pairs that was introduced by Hush and Scovel. In order to extend their results to arbitrary instances of Convex Quadratic Optimization, we introduce the general notion of a rate certifying q-set. We improve on the results by Hush and Scovel (2003) in several ways. First our result holds for Convex Quadratic Optimization whereas the results by Hush and Scovel are specialized to SVM-optimization. Second, we achieve a higher rate of convergence even for the special case of SVM-optimization (despite the generality of our approach). Third, our analysis is technically simpler. We prove furthermore that the strategy for working set selection which is based on rate certifying sets coincides with a strategy which is based on a so-called “sparse witness of sub-optimality”. Viewed from this perspective, our main result improves on convergence results by List and Simon (2004) and Simon (2004) by providing convergence rates (and by holding under more general conditions). Keywords: convex quadratic optimization, decomposition algorithms, support vector machines
same-paper 2 0.92605805 9 jmlr-2007-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency
3 0.90235025 43 jmlr-2007-Integrating Naïve Bayes and FOIL
Author: Niels Landwehr, Kristian Kersting, Luc De Raedt
Abstract: A novel relational learning approach that tightly integrates the na¨ve Bayes learning scheme with ı the inductive logic programming rule-learner FOIL is presented. In contrast to previous combinations that have employed na¨ve Bayes only for post-processing the rule sets, the presented approach ı employs the na¨ve Bayes criterion to guide its search directly. The proposed technique is impleı mented in the N FOIL and T FOIL systems, which employ standard na¨ve Bayes and tree augmented ı na¨ve Bayes models respectively. We show that these integrated approaches to probabilistic model ı and rule learning outperform post-processing approaches. They also yield significantly more accurate models than simple rule learning and are competitive with more sophisticated ILP systems. Keywords: rule learning, na¨ve Bayes, statistical relational learning, inductive logic programming ı
4 0.89864486 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin
Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classifier achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classifier achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassifications. Keywords: set covering machines, sample-compression, loss bounds
5 0.61491901 71 jmlr-2007-Refinable Kernels
Author: Yuesheng Xu, Haizhang Zhang
Abstract: Motivated by mathematical learning from training data, we introduce the notion of refinable kernels. Various characterizations of refinable kernels are presented. The concept of refinable kernels leads to the introduction of wavelet-like reproducing kernels. We also investigate a refinable kernel that forms a Riesz basis. In particular, we characterize refinable translation invariant kernels, and refinable kernels defined by refinable functions. This study leads to multiresolution analysis of reproducing kernel Hilbert spaces. Keywords: refinable kernels, refinable feature maps, wavelet-like reproducing kernels, dual kernels, learning with kernels, reproducing kernel Hilbert spaces, Riesz bases
6 0.6141693 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
8 0.59137166 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
9 0.59032434 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
10 0.56824738 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
12 0.54435652 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
13 0.50036567 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
14 0.49752054 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
15 0.4949379 44 jmlr-2007-Large Margin Semi-supervised Learning
16 0.49291277 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
17 0.48341072 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
18 0.48300341 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
19 0.47777101 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
20 0.46998334 47 jmlr-2007-Learning Horn Expressions with LOGAN-H