jmlr jmlr2006 jmlr2006-82 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter J. Bickel, Ya'acov Ritov, Alon Zakai
Abstract: We give a review of various aspects of boosting, clarifying the issues through a few simple results, and relate our work and that of others to the minimax paradigm of statistics. We consider the population version of the boosting algorithm and prove its convergence to the Bayes classifier as a corollary of a general result about Gauss-Southwell optimization in Hilbert space. We then investigate the algorithmic convergence of the sample version, and give bounds to the time until perfect separation of the sample. We conclude by some results on the statistical optimality of the L2 boosting. Keywords: classification, Gauss-Southwell algorithm, AdaBoost, cross-validation, non-parametric convergence rate
Reference: text
sentIndex sentText sentNum sentScore
1 We consider the population version of the boosting algorithm and prove its convergence to the Bayes classifier as a corollary of a general result about Gauss-Southwell optimization in Hilbert space. [sent-11, score-0.166]
2 The AdaBoost classifier is given by sgn ∑M λm hm (x) , where λm ∈ R, hm ∈ H , are found sequentially by the following m=1 algorithm: 0. [sent-25, score-0.325]
3 Set i=1 λm = ∑hm (Xi )=Yi ci 1 1 ∑n ci + ∑n ci hm (Xi )Yi i=1 log i=1 . [sent-29, score-0.208]
4 = log n n 2 2 ∑i=1 ci − ∑i=1 ci hm (Xi )Yi ∑hm (Xi )=Yi ci c 2006 Peter J. [sent-30, score-0.208]
5 Set ci ← ci exp −λm hm (Xi )Yi , and m ← m + 1, If m ≤ M, return to step 1. [sent-33, score-0.175]
6 B¨ hlmann and Yu (2003) considered L2 boosting starting from very smooth functions. [sent-47, score-0.182]
7 The structure of, and the differences between, the population and sample versions of the optimization problem has been explored in various ways by Jiang (2003), Zhang and Yu (2003), B¨ hlmann u (2003), Bartlett, Jordan, and McAuliffe (2003), Bickel and Ritov (2003). [sent-53, score-0.077]
8 To relate our work and that of B¨ hlmann (2003), B¨ hlmann and Yu (2003), Lugosi and Vayu u atis (2004), Zhang (2004), Zhang and Yu (2003) and Bartlett, Jordan, and McAuliffe (2003) to the minimax results of Mammen and Tsybakov (1999), Baraud (2001) and Tsybakov (2001). [sent-57, score-0.16]
9 In Section 2 we will discuss the population version of the basic boosting algorithms and show how their convergence and that of more general greedy algorithms can be derived from a generalization of Theorem 3 of Mallat and Zhang (1993) with a simple proof. [sent-58, score-0.166]
10 In Section 5 we address the issue of bounding the time to perfect separation of the different boosting algorithm (including the standard AdaBoost). [sent-61, score-0.117]
11 Let H = H ′ ∪ (−H ′ ), where H ′ is a subset of H whose members are linearly independent, with linear span F ∞ = {∑k λm hm : λ j ∈ R, h j ∈ H , 1 ≤ j ≤ k, 1 ≤ k < ∞}. [sent-72, score-0.158]
12 by, fm+1 = fm + λm hm , λm ∈ R, hm ∈ H and w( fm + λm hm ) ≤ α min λ∈R, h∈H w( fm + λh) + (1 − α)w( fm ) . [sent-81, score-3.784]
13 (1) Generalize Algorithm I to : Algorithm II: Like Algorithm I, but replace (1) by w( fm + λm h) + γλ2 ≤ α m min (w( fm + λh) + γλ2 ) + (1 − α)w( fm ) . [sent-82, score-2.52]
14 The standard boosting algorithms theoretically correspond to α = 1, although in practice, since numerical minimization is used, α may equal 1 only approximately. [sent-85, score-0.117]
15 Our generalization makes for a simple proof and covers the possibility that the minimum of w( fm + λh) over H and R is not assumed, or multiply assumed. [sent-86, score-0.835]
16 For any c1 and c2 such that ω0 < c1 < c2 < ∞, 0 < inf {w′′ ( f , h) : c1 < w( f ) < c2 , h ∈ H } ≤ sup {w′′ ( f , h) : w( f ) < c2 , h ∈ H } < ∞. [sent-91, score-0.129]
17 Theorem 1 Under Assumption GS1, any sequence of functions generated according to Algorithm I satisfies: w( fm ) ≤ ω0 + cm and if cm > 0: w( fm ) − w( fm+1 ) ≥ ξ(w( fm )) > 0 where the sequence cm → 0 and the function ξ(·) depend only on α, the initial points of the iterates, and H . [sent-94, score-2.648]
18 The same conclusion holds under Condition GS2 for any sequence fm generated according to algorithm II. [sent-95, score-0.848]
19 That is, given F0 ∈ H fixed, we define for m ≥ 0: λm (h) = arg min EW Y Fm (X) + λh(X) λ∈R hm = arg min EW Y Fm (X) + λm (h)h(X) h∈H Fm+1 = Fm + λm (hm )hm . [sent-107, score-0.215]
20 W ′′ (F) is bounded above and below on {F : c1 < W (F) < c2 } for all c1 , c2 such that inf EW (F) < c1 < c2 < EW (F0 ). [sent-126, score-0.074]
21 The functions commonly appearing in boosting such as, W1 (t) = e−t , W2 (t) = −2t +t 2 , W3 (t) = − log(1 + e−2t ) satisfy condition P4 if P1 also holds. [sent-136, score-0.133]
22 Consistency of the Boosting Algorithm In this section we study the Bayes consistency properties of the sample versions of the boosting algorithms we considered in Section 2. [sent-148, score-0.133]
23 In particular, we shall (i) Show that under mild additional conditions, there will exist a random sequence mn → ∞ such P ˆ ˆ that Fmn −→ F∞ , where Fm is defined below as the mth sample iterate, and moreover, that such a sequence can be determined using the data. [sent-149, score-0.178]
24 with ϑ0 = ϑ0 and satisfying: ¯ ¯ ϑm+1 ∈ Πm (ϑm ) ¯ K(ϑm+1 ) ≤ α inf ¯ ϑ∈Πm (ϑm ) ¯ K(ϑ) + (1 − α)K(ϑm ). [sent-171, score-0.074]
25 The resemblance to Gauss-Southwell Algorithm I and the boosting procedures is not accidental. [sent-172, score-0.117]
26 If {ϑm } ∈ S (ϑ0 , α) with any initial ϑ0 , then K(ϑm ) − K(ϑm+1 ) ≥ ξ K(ϑm ) − K(ϑ∞ ) , for ¯ m ) − K(ϑ∞ ) ≤ cm where cm → 0 uniformly over S (ϑ0 , α). [sent-174, score-0.078]
27 Let Mn → ∞ be some sequence, and let mn = arg minm≤Mn K(ϑm,n ). [sent-189, score-0.159]
28 Suppose otherwise: ˆ inf K(ϑm,n ) − K(ϑ∞ ) ≥ c1 > 0, n∈N m≤Mn (3) ˆ where N is unbounded with positive probability. [sent-192, score-0.074]
29 Let ˆ mn = arg max m′ ≤ Mn : ∀m ≤ m′ , εm−1,n + 2εm,n < (α′ − α)ξ(c1 ) & ϑm,n ∈ Am . [sent-197, score-0.159]
30 Therefore, ˆ since mn → ∞, K(ϑmn ,n ) → K(ϑ∞ ), contradicting (3). [sent-204, score-0.138]
31 In fact we have proved that sequences mn can be chosen in the following way involving K. [sent-205, score-0.138]
32 712 S OME T HEORY FOR G ENERALIZED B OOSTING A LGORITHMS ∗ ˆˆ ˆ ˆ Then mn = arg min{Kn (ϑm,n ) : 1 ≤ m ≤ Mn } yields an appropriate ϑmn sequence. [sent-211, score-0.159]
33 Test Bed Stopping Again we face the issue of data dependent and in some way optimal selection of mn . [sent-235, score-0.138]
34 Let ˆ ϑm : X → R, 1 ≤ m ≤ mn be data dependent functions which depend only on (X1 ,Y1 ), . [sent-245, score-0.138]
35 Condition C very simply asks that the test sample size Bn be large only: (i) In terms of rn , the minimax rate of convergence; (ii) In terms of the logarithm of the number of procedures being ˆ ˆ studied. [sent-265, score-0.096]
36 Hence, the boosting algorithm perfectly separates the data after at most 2κ(n ∑k |α j |)2 steps. [sent-303, score-0.117]
37 j=1 Proof Let, for i such that Yi Fm (xi ) < 0, n fm (λ; h) = n−1 ∑ W Yi Fm (xs ) + λh(xs ) , s=1 ′ and fm (0; h) = d fm (λ; h)/dλ λ=0 . [sent-304, score-2.505]
38 Then j=1 k k n j=1 j=1 s=1 ′ ∑ α j fm (0; h j ) = n−1 ∑ α j ∑ W ′ Yi Fm (xs ) Yi h j (xs ) = n−1W ′ Yi Fm (xi ) . [sent-310, score-0.835]
39 Hence ′ inf fm (0; h) ≤ h∈H 1 n ∑k α j j=1 min W ′ (Yi Fm (xi )) ≤ i W ′ (0) −1 = , k n ∑ j=1 α j n ∑k α j j=1 (4) since Yi Fm (xi ) < 0 for at least one i. [sent-311, score-0.924]
40 Denote by λ Taylor expansion: ¯ λ2 n ′′ ¯ ¯ ¯ ′ ¯ ¯ ¯ fm (λ; h) = fm (0; h) + λ fm (0; h) + ∑ W Yi Fm (xs ) + λ(λ)h(xs ) 2n s=1 λ ′ ¯ ¯ ¯ = fm (0; h) + inf λ fm (0; h) + ∑ W ′′ Yi Fm (xs ) + λ(λ)h(xs ) 2n s=1 λ 2 n ¯ where λ(λ) lies between 0 and λ. [sent-316, score-4.249]
41 Combining (4) and (5) and the mini¯ mizing property of h, 2 ′ ¯ fm (0; h) 2κ 1 ¯ ≤ fm (0; h) − 2κ(n ∑k α j )2 j=1 ¯ ¯ ¯ fm (λ; h) ≤ fm (0; h) − . [sent-318, score-3.34]
42 We note that B¨ hlmann and Yu (2003) have introduced a version of L2 boosting which achieves minimax rates u for Sobolev classes on R adaptively already. [sent-325, score-0.234]
43 However, their construction is in a different spirit than that of most boosting papers. [sent-326, score-0.117]
44 They start out with H consisting of one extremely smooth and complex function and show that boosting reduces bias (roughness of the function) while necessarily increasing variance. [sent-327, score-0.139]
45 Early stopping is still necessary and they show it can achieve minimax rates. [sent-328, score-0.091]
46 It follows, using a result of Yang (1999) that our rule is adaptive minimax for classification loss for some of the classes we have mentioned as well. [sent-329, score-0.074]
47 They went further by obtaining near minimax results on suitable sets. [sent-338, score-0.074]
48 716 S OME T HEORY FOR G ENERALIZED B OOSTING A LGORITHMS We also limit our results to L2 boosting, although we believe this limitation is primarily due to the lack of minimax theorems for prediction when other losses than L2 are considered. [sent-339, score-0.074]
49 We show in Theorem 8 our variant of L2 boosting achieves minimax rates for estimating E(Y |X) in a wide class of situations. [sent-341, score-0.191]
50 (2003), that the rules we propose are also minimax for 0–1 loss in suitable spaces. [sent-345, score-0.074]
51 = sup | f (x, y)| f 2 dµ x,y FP (X) ≡ EP (Y |X) ˆ Fm (X) = arg min{ t(X) −Y Fm (X) = arg min{ t(X) −Y EX ≡ 2 n: 2 P: t ∈F t ∈F (m) } (m) } Conditional expectation given X1 , . [sent-369, score-0.097]
52 Let ˆ rn (P) = inf{EP Fm − FP 2 : 1 ≤ m ≤ Mn }, rn ≡ sup rn (P). [sent-402, score-0.121]
53 P P∈P ˆ Thus, rn is the minimax regret for an oracle knowing P but restricted to Fm . [sent-403, score-0.163]
54 If ∆m,n = O(Dm /n), then, ˆ ˆ sup EP F(X) − FP (X) 2 ≍ rn . [sent-405, score-0.077]
55 3 Discussion 1) If X ∈ R and H (m) consists of stumps with the discontinuity at a dyadic rational j/2m , then F (m) is the linear space of Haar wavelets of order m. [sent-431, score-0.085]
56 e o The resulting minimax risk, ˆ min max{EP F(X) − EP (Y |X) ˆ F 2 : EP (Y |X) ∈ F } is always of order n−1+ε Ω(n) where Ω(n) is typically constant and 0 < ε < 1. [sent-442, score-0.089]
57 Tsybakov implicitly defines a class of FP for which he is able to specify classification minimax rates. [sent-448, score-0.074]
58 , xd−1 ), P[|FP (x)| ≤ t] ≤ Ct, for all 0 ≤ t ≤ 1, b ∈ Σ(ℓ, L)} Tsybakov following Mammen and Tsybakov (1999) shows that the classification minimax 2ℓ regret for P (Theorem 2 of Tsybakov (2001) for K = 2) is 3ℓ+(d−1) . [sent-459, score-0.109]
59 On the other hand, if we assume that Y = FP (x) + ε where ε is independent of X, bounded and E(ε) = 0, then the L2 minimax regret rate is 2ℓ/(2ℓ + (d − 1)) – see Birg´ and Massart (1999) Sections 4. [sent-460, score-0.109]
60 Our theorem 9 now yields a classification minimax regret rate of 2ℓ 2ℓ 2 · = 3 3 2ℓ + (d − 1) 3ℓ + 2 (d − 1) which is slightly worse than what can be achieved using Tsybakov’s not as readily computable procedures. [sent-463, score-0.134]
61 However, note that as ℓ → ∞ so that FP and the boundary become arbitrarily 2 smooth, L2 boosting approaches the best possible rate for P ℓ of 3 . [sent-464, score-0.117]
62 The boosting algorithm is a Gauss-Southwell minimization of a classification loss function (which typically dominates the 0-1 misclassification loss). [sent-474, score-0.117]
63 We show that the output of the boosting algorithm follows the theoretical path as if it were applied to the true distribution of the population. [sent-475, score-0.117]
64 However, there are no simple rate results other than those of B¨ hlmann and Yu (2003), which u we discuss, for the convergence of the boosting classifier to the Bayes classifier. [sent-477, score-0.175]
65 We showed that rate results can be obtained when the boosting algorithm is modified to a cautious version, in which at each step the boosting is done only over a small set of permitted directions. [sent-478, score-0.248]
66 Proof of Theorem 1: ∗ Let w0 = inf f ∈F ∞ w( f ). [sent-481, score-0.074]
67 Let fk = ∑m αkm hkm , hk,m ∈ H , ∑m |αkm | < ∞, k = 0, 1, 2, . [sent-482, score-0.142]
68 be any member ∗ ∗ of F ∞ such that (i) f0 = f0 ; (ii) w( fk ) ց w0 is strictly decreasing sequence; (iii) The following condition is satisfied: ∗ ∗ w( fk ) ≥ αw0 + (1 − α)w( fk−1 ) + (1 − α)(νk−1 − νk ), (11) ∗ where νk ց 0 is a strictly decreasing real sequence. [sent-485, score-0.3]
69 The construction of the sequence { fk } is possible since, by assumption, F ∞ is dense in the image of w(·). [sent-486, score-0.155]
70 That is, we can start with the ∗ ∗ sequence {w( fk )}, and then look for suitable { fk }. [sent-487, score-0.297]
71 Select now fk such ∗ w0 + c(1 − η)γk ≤ w( fk ) ≤ w0 + c(1 + η)γk . [sent-491, score-0.284]
72 ) Our argument rests on the following, selected such that w( f1 0 ∗ Lemma 12 There is a sequence mk → ∞ such that w( fm ) ≤ w( fk ) + νk for m ≥ mk , k = 1, 2, . [sent-493, score-1.284]
73 , and mk ≤ ζk (mk−1 ) < ∞, where ζk (·) is a monotone non-decreasing functions which depends only ∗ on the sequences {νk } and { fk }. [sent-496, score-0.289]
74 ′′ and ∗ τk = 2 f0 − fk 2 ∗ 16 ρk = w( f0 ) − w0 . [sent-501, score-0.142]
75 αβk (14) Having defined mk we establish (12) as part of our induction hypothesis for mk−1 < m ≤ mk . [sent-502, score-0.308]
76 Having established the induction for m ≤ mk−1 we define mk as follows. [sent-505, score-0.161]
77 Write now the RHS of (12) as g(mk−1 ), where √ ∗ w( fk−1 ) − w0 512B , g(ν) ≡ max νk , 1/2 α2 βk ∗ 8(w( fk−1 )−w0 ) log 1 + αβk (τk +ρk ν (m − ν + 1) 722 S OME T HEORY FOR G ENERALIZED B OOSTING A LGORITHMS We can now pick ζk (ν) ≡ max ν + 1, min{m : g(ν) ≤ νk } , and define mk = ζk (νk−1 ). [sent-506, score-0.164]
78 ∗ Note that {βk }, {τk }, {ρk }, and B depend only the sequences { fk } and {νk }. [sent-507, score-0.142]
79 By induction (12) holds for m ≤ mk−1 , and my hold for some m > mk − 1. [sent-511, score-0.161]
80 We have established (12) save for m such that, inf w( fm + λhm ) > w0 + νk and εk,m ≥ 0. [sent-516, score-0.909]
81 Note first that by convexity, ∗ ∗ |w′ ( fm ; fm − fk )| ≥ w( fm ) − w( fk ) ≡ εk,m . [sent-518, score-2.789]
82 (17) ∗ We obtain from (17) and the linearity of the derivative that, if fm − fk = ∑ γi hi ∈ F ∞ , εk,m ≤ ˜ ∑ −γi w′ ( fm ; hi ) ≤ sup |w′ ( fm ; h)| ∑ |γi | . [sent-519, score-2.702]
83 h∈H Hence sup |w′ ( fm ; h)| ≥ h∈H εk,m ∗ fm − fk . [sent-520, score-1.867]
84 (18) ∗ Now, if fm+1 = fm + λm hm then, 1 w( fm + λm hm ) = w( fm ) + λm w′ ( fm ; hm ) + λ2 w′′ ( f˜m ; hm ), 2 m λ ∈ [0, λm ]. [sent-521, score-3.912]
85 By convexity, for 0 ≤ λ ≤ λm , w( fm + λhm ) = w( fm (1 − λ λ )+ fm+1 ) ≤ max{w( fm ), w( fm+1 )} = w( fm ) ≤ w( f1 ). [sent-523, score-3.34]
86 But then we conclude from (19) that, 1 w( fm + λm hm ) ≥ w( fm ) + inf λw′ ( fm ; hm ) + λ2 βk 2 λ∈R ′ ( f ; h )|2 |w m m . [sent-525, score-2.865]
87 = w( fm ) − 2βk (20) Note that w( fm +λh) = w( fm )+λw′ ( fm , h)+λ2 w′′ ( fm +λ′ h, h)/2 for some λ′ ∈ [0, λ], and if w( fm + λh) is close to infλ,h w( fm + λ, h) then by convexity, w( fm + λ′ h) ≤ w( fm ) ≤ w( f0 ). [sent-526, score-7.515]
88 We obtain from the upper bound on w′′ we obtain: w( fm + λm hm ) ≤ α ≤α inf λ∈R, h∈H w( fm + λh) + (1 − α)w( fm ), by definition, 1 (w( fm ) + λw′ ( fm ; h) + λ2 B) + (1 − α)w( fm ) 2 λ∈R, h∈H inf = w( fm ) − (21) α suph∈H |w′ ( fm ; h)|2 , 2B by minimizing over λ. [sent-527, score-6.971]
89 Hence combining (20) and (21) we obtain, |w′ ( fm ; hm )| ≥ α sup |w′ ( fm ; h)| h∈H βk B (22) By (21) for the LHS and convexity for the RHS: α suph∈H |w′ ( fm ; h)|2 ≤ w( fm ) − w( fm+1 ) ≤ −λm w′ ( fm ; hm ) 2B Hence |λm | ≥ α suph∈H |w′ ( fm ; h)| . [sent-528, score-5.351]
90 Similarly 2 w( fm+1 ) − w( fm + λ0 hm ) m βk 2(1 − α) ≤ w( fm ) − w( fm+1 ) αβk (λm − λ0 )2 ≤ m (26) Combining (25) and (26): λ2 ≤ m 8 w( fm ) − w( fm+1 ) . [sent-533, score-2.648]
91 Proof of Theorem 1: Since the lemma established the existence of monotone ζk ’s, it followed ∗ ∗ from the definition of these function that w( fm ) ≤ w( fk(m) ) where k(m) = sup{k : ζ(k) ( f0 ) ≤ m} and ∗ ζ(k) = ζk ◦ · · · ◦ ζ1 is the kth iterate of the ζs. [sent-539, score-0.849]
92 From (26) and (23) if εk,m ≥ 0 w( fm ) − w( fm+1 ) ≥ αβk 2 αβk λm ≥ 2 2 α εk,m 2B lk,m 2 , (33) Bound lk,m similarly to (30) by lk,m ≤ lk,1 + m1/2 m ∑ λ2 i i=1 2 ≤ lk,1 + 8m (w( f0 ) − w0 ). [sent-542, score-0.835]
93 In particular, m ≤ m∗ w( fm ) for 726 S OME T HEORY FOR G ENERALIZED B OOSTING A LGORITHMS any m and any realization of the algorithm. [sent-545, score-0.835]
94 We obtain therefore by plugging-in (34) in (33), using the m∗ as a bound on m and the identity (a + b)2 ≤ 2a2 + 2b2 that: w( fm ) − w( fm+1 ) ≥ ∗ w( fm ) − w( fk ) α3 βk , 2 16B2 lk,1 + 8m∗ w( fm ) w( f0 ) − wo /αβk as long as εk,m ≥ 0. [sent-546, score-2.647]
95 Taking the maximum of the RHS over the permitted range, yields a candidate for the ξ function: ξ(w) ≡ ∗ w − w( fk ) α3 βk . [sent-547, score-0.156]
96 2 16B2 lk,1 + m∗ w w( f0 ) − wo /αβk sup ∗ k: w( fk )≤w This proves the theorem under GS1. [sent-548, score-0.222]
97 Proof of Lemmas 10 and 11 and Theorem 8 Proof of Lemma 10 Since by (R2) λmax (Gm (P)) = sup x′ Gm (P)x x =1 = sup x =1 = sup x =1 ∑∑ xi x j Z fm,i fm,d dP (∑ xi fm,i )2 dP ≤ ε−1 sup x =1 λmax (Gm (P)) ≥ ε, Z Z (35) (∑ xi fm,i )2 dµ = ε−1 similarly. [sent-553, score-0.265]
98 (37) If H is a VC class, we can conclude from (35)–(37) that, ˆ P[γ(Gm ) ≥ C1 ] ≤ C2 exp{−C3 n/L2 Dm } (38) 1 2 since by R1 (i), fm ∞ ≤ C∞ Dm . [sent-559, score-0.835]
99 Consistency for L2 boosting and matching pursuit with trees and tree type base funcu tions. [sent-676, score-0.131]
100 Additive logistic regression: a statistical view of boosting (with discussion). [sent-704, score-0.117]
wordName wordTfidf (topN-words)
[('fm', 0.835), ('mk', 0.147), ('hm', 0.143), ('fk', 0.142), ('mn', 0.138), ('fp', 0.131), ('ep', 0.125), ('boosting', 0.117), ('ew', 0.108), ('akai', 0.1), ('ickel', 0.1), ('itov', 0.1), ('dm', 0.093), ('eneralized', 0.093), ('heory', 0.093), ('gm', 0.088), ('ome', 0.079), ('oosting', 0.079), ('minimax', 0.074), ('inf', 0.074), ('baraud', 0.064), ('xs', 0.063), ('tsybakov', 0.055), ('sup', 0.055), ('lgorithms', 0.05), ('kn', 0.049), ('hlmann', 0.043), ('yu', 0.043), ('yi', 0.041), ('bickel', 0.04), ('cm', 0.039), ('sgn', 0.039), ('ex', 0.038), ('massart', 0.036), ('birg', 0.036), ('stumps', 0.036), ('sieve', 0.036), ('regret', 0.035), ('bn', 0.034), ('population', 0.034), ('oracle', 0.032), ('bartlett', 0.032), ('ritov', 0.03), ('rhs', 0.03), ('mcauliffe', 0.028), ('wavelets', 0.027), ('lk', 0.026), ('theorem', 0.025), ('adaboost', 0.025), ('bed', 0.024), ('xb', 0.024), ('rn', 0.022), ('smooth', 0.022), ('dyadic', 0.022), ('mammen', 0.022), ('barron', 0.022), ('fmk', 0.021), ('suph', 0.021), ('wlog', 0.021), ('bayes', 0.021), ('arg', 0.021), ('xn', 0.021), ('dp', 0.02), ('misclassi', 0.02), ('zhang', 0.02), ('pn', 0.019), ('fj', 0.018), ('jerusalem', 0.018), ('classi', 0.018), ('fb', 0.017), ('gy', 0.017), ('stopping', 0.017), ('log', 0.017), ('consistency', 0.016), ('gram', 0.016), ('golden', 0.016), ('ci', 0.016), ('cn', 0.016), ('condition', 0.016), ('convergence', 0.015), ('min', 0.015), ('mallat', 0.015), ('fy', 0.015), ('sobolev', 0.015), ('vc', 0.015), ('xi', 0.015), ('span', 0.015), ('baron', 0.014), ('permitted', 0.014), ('sgnf', 0.014), ('lemma', 0.014), ('induction', 0.014), ('pursuit', 0.014), ('mth', 0.014), ('xd', 0.014), ('lugosi', 0.014), ('friedman', 0.013), ('sequence', 0.013), ('wavelet', 0.013), ('suppose', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
Author: Peter J. Bickel, Ya'acov Ritov, Alon Zakai
Abstract: We give a review of various aspects of boosting, clarifying the issues through a few simple results, and relate our work and that of others to the minimax paradigm of statistics. We consider the population version of the boosting algorithm and prove its convergence to the Bayes classifier as a corollary of a general result about Gauss-Southwell optimization in Hilbert space. We then investigate the algorithmic convergence of the sample version, and give bounds to the time until perfect separation of the sample. We conclude by some results on the statistical optimality of the L2 boosting. Keywords: classification, Gauss-Southwell algorithm, AdaBoost, cross-validation, non-parametric convergence rate
2 0.27502608 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
Author: Guillaume Lecué
Abstract: In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of M density estimators for the Kullback-Leibler divergence (KL), the Hellinger’s distance and the L1 -distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000a). Combining these results, we state that log M/n is an optimal rate of aggregation in the sense of Tsybakov (2003), where n is the sample size. Keywords: aggregation, optimal rates, Kullback-Leibler divergence
3 0.13582039 83 jmlr-2006-Sparse Boosting
Author: Peter Bühlmann, Bin Yu
Abstract: We propose Sparse Boosting (the SparseL2 Boost algorithm), a variant on boosting with the squared error loss. SparseL2 Boost yields sparser solutions than the previously proposed L2 Boosting by minimizing some penalized L2 -loss functions, the FPE model selection criteria, through smallstep gradient descent. Although boosting may give already relatively sparse solutions, for example corresponding to the soft-thresholding estimator in orthogonal linear models, there is sometimes a desire for more sparseness to increase prediction accuracy and ability for better variable selection: such goals can be achieved with SparseL2 Boost. We prove an equivalence of SparseL2 Boost to Breiman’s nonnegative garrote estimator for orthogonal linear models and demonstrate the generic nature of SparseL2 Boost for nonparametric interaction modeling. For an automatic selection of the tuning parameter in SparseL2 Boost we propose to employ the gMDL model selection criterion which can also be used for early stopping of L2 Boosting. Consequently, we can select between SparseL2 Boost and L2 Boosting by comparing their gMDL scores. Keywords: lasso, minimum description length (MDL), model selection, nonnegative garrote, regression
4 0.065127388 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
Author: Andrea Caponnetto, Alexander Rakhlin
Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes
5 0.058732349 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
Author: Régis Vert, Jean-Philippe Vert
Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation
6 0.049908709 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
7 0.046377562 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
8 0.044611137 45 jmlr-2006-Learning Coordinate Covariances via Gradients
9 0.04220365 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
10 0.038588859 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
11 0.034043472 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
12 0.03324547 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
13 0.032481473 48 jmlr-2006-Learning Minimum Volume Sets
14 0.032102045 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
15 0.029404297 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
16 0.026217027 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
17 0.023994481 75 jmlr-2006-Policy Gradient in Continuous Time
18 0.023454076 16 jmlr-2006-Bounds for Linear Multi-Task Learning
19 0.022330714 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification
20 0.022174468 93 jmlr-2006-Universal Kernels
topicId topicWeight
[(0, 0.146), (1, -0.038), (2, -0.019), (3, -0.305), (4, -0.104), (5, 0.311), (6, -0.166), (7, 0.105), (8, 0.051), (9, -0.119), (10, -0.27), (11, 0.233), (12, -0.117), (13, 0.245), (14, 0.073), (15, 0.174), (16, 0.109), (17, 0.101), (18, 0.129), (19, -0.006), (20, 0.072), (21, 0.119), (22, -0.097), (23, 0.145), (24, 0.007), (25, 0.085), (26, -0.122), (27, -0.008), (28, 0.078), (29, 0.035), (30, 0.007), (31, -0.001), (32, 0.006), (33, -0.008), (34, 0.034), (35, -0.063), (36, 0.048), (37, 0.052), (38, 0.02), (39, -0.009), (40, 0.033), (41, -0.01), (42, -0.002), (43, -0.012), (44, -0.028), (45, 0.017), (46, -0.031), (47, -0.016), (48, 0.019), (49, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.98000765 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
Author: Peter J. Bickel, Ya'acov Ritov, Alon Zakai
Abstract: We give a review of various aspects of boosting, clarifying the issues through a few simple results, and relate our work and that of others to the minimax paradigm of statistics. We consider the population version of the boosting algorithm and prove its convergence to the Bayes classifier as a corollary of a general result about Gauss-Southwell optimization in Hilbert space. We then investigate the algorithmic convergence of the sample version, and give bounds to the time until perfect separation of the sample. We conclude by some results on the statistical optimality of the L2 boosting. Keywords: classification, Gauss-Southwell algorithm, AdaBoost, cross-validation, non-parametric convergence rate
2 0.79526591 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
Author: Guillaume Lecué
Abstract: In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of M density estimators for the Kullback-Leibler divergence (KL), the Hellinger’s distance and the L1 -distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000a). Combining these results, we state that log M/n is an optimal rate of aggregation in the sense of Tsybakov (2003), where n is the sample size. Keywords: aggregation, optimal rates, Kullback-Leibler divergence
3 0.4701317 83 jmlr-2006-Sparse Boosting
Author: Peter Bühlmann, Bin Yu
Abstract: We propose Sparse Boosting (the SparseL2 Boost algorithm), a variant on boosting with the squared error loss. SparseL2 Boost yields sparser solutions than the previously proposed L2 Boosting by minimizing some penalized L2 -loss functions, the FPE model selection criteria, through smallstep gradient descent. Although boosting may give already relatively sparse solutions, for example corresponding to the soft-thresholding estimator in orthogonal linear models, there is sometimes a desire for more sparseness to increase prediction accuracy and ability for better variable selection: such goals can be achieved with SparseL2 Boost. We prove an equivalence of SparseL2 Boost to Breiman’s nonnegative garrote estimator for orthogonal linear models and demonstrate the generic nature of SparseL2 Boost for nonparametric interaction modeling. For an automatic selection of the tuning parameter in SparseL2 Boost we propose to employ the gMDL model selection criterion which can also be used for early stopping of L2 Boosting. Consequently, we can select between SparseL2 Boost and L2 Boosting by comparing their gMDL scores. Keywords: lasso, minimum description length (MDL), model selection, nonnegative garrote, regression
4 0.25011599 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
Author: Andrea Caponnetto, Alexander Rakhlin
Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes
5 0.18910372 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
Author: Régis Vert, Jean-Philippe Vert
Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation
6 0.13755803 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
7 0.13191238 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
8 0.12292309 45 jmlr-2006-Learning Coordinate Covariances via Gradients
9 0.12014002 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
10 0.11973274 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
11 0.11734261 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
12 0.11073014 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
13 0.10838693 48 jmlr-2006-Learning Minimum Volume Sets
14 0.10412652 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
15 0.10165296 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
16 0.099354997 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
17 0.097914256 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)
18 0.095289543 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
19 0.093600444 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
20 0.091921248 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
topicId topicWeight
[(8, 0.014), (11, 0.337), (14, 0.042), (36, 0.07), (45, 0.029), (50, 0.085), (57, 0.016), (63, 0.016), (76, 0.033), (78, 0.017), (81, 0.085), (84, 0.033), (90, 0.033), (91, 0.015), (96, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.78496981 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
Author: Peter J. Bickel, Ya'acov Ritov, Alon Zakai
Abstract: We give a review of various aspects of boosting, clarifying the issues through a few simple results, and relate our work and that of others to the minimax paradigm of statistics. We consider the population version of the boosting algorithm and prove its convergence to the Bayes classifier as a corollary of a general result about Gauss-Southwell optimization in Hilbert space. We then investigate the algorithmic convergence of the sample version, and give bounds to the time until perfect separation of the sample. We conclude by some results on the statistical optimality of the L2 boosting. Keywords: classification, Gauss-Southwell algorithm, AdaBoost, cross-validation, non-parametric convergence rate
2 0.75620568 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
Author: Tomáš Šingliar, Miloš Hauskrecht
Abstract: We develop a new component analysis framework, the Noisy-Or Component Analyzer (NOCA), that targets high-dimensional binary data. NOCA is a probabilistic latent variable model that assumes the expression of observed high-dimensional binary data is driven by a small number of hidden binary sources combined via noisy-or units. The component analysis procedure is equivalent to learning of NOCA parameters. Since the classical EM formulation of the NOCA learning problem is intractable, we develop its variational approximation. We test the NOCA framework on two problems: (1) a synthetic image-decomposition problem and (2) a co-citation data analysis problem for thousands of CiteSeer documents. We demonstrate good performance of the new model on both problems. In addition, we contrast the model to two mixture-based latent-factor models: the probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA). Differing assumptions underlying these models cause them to discover different types of structure in co-citation data, thus illustrating the benefit of NOCA in building our understanding of highdimensional data sets. Keywords: component analysis, vector quantization, variational learning, link analysis
3 0.39697021 66 jmlr-2006-On Model Selection Consistency of Lasso
Author: Peng Zhao, Bin Yu
Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency
4 0.39097834 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
Author: Mikio L. Braun
Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds
5 0.39085665 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers
Author: Francis R. Bach, David Heckerman, Eric Horvitz
Abstract: Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classifiers for all feasible ratios of the costs associated with false positives and false negatives. For linear classifiers, the set of classifiers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classifiers whose performances can be plotted in the ROC plane. We consider the alternative of varying the asymmetry of the cost function used for training. We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. In addition, we present a path-following algorithm for the support vector machine (SVM) that can compute efficiently the entire ROC curve, and that has the same computational complexity as training a single classifier. Finally, we provide a theoretical analysis of the relationship between the asymmetric cost model assumed when training a classifier and the cost model assumed in applying the classifier. In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantifiable difference around extreme asymmetries. Keywords: support vector machines, receiver operating characteristic (ROC) analysis, linear classification
6 0.38589612 83 jmlr-2006-Sparse Boosting
7 0.38455284 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
9 0.37998915 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
10 0.37572092 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
11 0.36996761 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
12 0.36901569 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
13 0.36507455 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
14 0.36504716 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
15 0.36393538 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
16 0.35880166 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
17 0.35832304 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
18 0.35608587 48 jmlr-2006-Learning Minimum Volume Sets
19 0.35345027 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling
20 0.34908149 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation