jmlr jmlr2013 jmlr2013-114 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
Reference: text
sentIndex sentText sentNum sentScore
1 Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. [sent-10, score-0.446]
2 We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. [sent-13, score-0.446]
3 tc (1) In other words, the exponential loss of AdaBoost will be at most ε more than that of any other parameter vector λ of ℓ1 -norm bounded by B in a number of rounds that is bounded by a polynomial in log N, m, B and 1/ε. [sent-48, score-0.446]
4 2316 T HE R ATE OF C ONVERGENCE OF A DA B OOST Within the proof of the second convergence rate, we provide a lemma (called the decomposition lemma) that shows that the training set can be split into two sets of examples: the “finite margin set,” and the “zero loss set. [sent-57, score-0.621]
5 If we consider the exponential loss where the sum is only over the finite margin set (rather than over all training examples), it is minimized by a finite λ. [sent-60, score-0.44]
6 Freund and Schapire 2 (1997) and Schapire and Singer (1999) showed that the exponential loss is at most e−2tγ after t rounds, so AdaBoost rapidly converges to the minimum possible loss under this assumption. [sent-71, score-0.386]
7 , T : • Train weak learner using distribution Dt ; that is, find weak hypothesis ht ∈ H whose correla△ tion rt = Ei∼Dt [yi ht (xi )] has maximum magnitude |rt |. [sent-109, score-0.559]
8 Then the exponential loss can be written more compactly as: L(λ) = 1 m −(Mλ)i ∑e m i=1 where (Mλ)i , the ith coordinate of the vector Mλ, is the (unnormalized) margin achieved by vector λ on training example i. [sent-118, score-0.485]
9 First Convergence Rate: Convergence To Any Target Loss In this section, we bound the number of rounds of AdaBoost required to get within ε of the loss attained by a parameter vector λ∗ as a function of ε and the ℓ1 -norm λ∗ 1 . [sent-125, score-0.392]
10 The vector λ∗ serves as a reference based on which we define the target loss L(λ∗ ), and we will show that its ℓ1 -norm measures the difficulty of attaining the target loss in a specific sense. [sent-126, score-0.415]
11 One key parameter is the suboptimality Rt of AdaBoost’s solution measured via the logarithm of the exponential loss: △ Rt = ln L(λt ) − ln L(λ∗ ). [sent-142, score-0.81]
12 c2 Lemma 2 If for some constants c1 , c2 , where c2 ≥ 1, the edge satisfies δt ≥ B−c1 Rt−1 in each round ∗ ) + ε loss after 2B2c1 (ε ln 2)1−2c2 rounds. [sent-155, score-0.662]
13 From the definition of Rt and (5) we have 1 ∆Rt = ln L(λt−1 ) − ln L(λt ) ≥ − ln(1 − δt2 ). [sent-157, score-0.664]
14 2 2 2 Let T = ⌈2B2c1 (ε ln 2)1−2c2 ⌉ be the bound on the number of rounds in the lemma. [sent-159, score-0.566]
15 , RT , and using c2 ≥ 1 we get R1−2c2 ≥ R1−2c2 + c2 − T 0 1 B−2c1 T ≥ (1/2)B−2c1 T ≥ (ε ln 2)1−2c2 =⇒ RT ≤ ε ln 2. [sent-168, score-0.664]
16 Otherwise, L(λT ) ≤ L(λ∗ )eε ln 2 ≤ L(λ∗ )(1 + ε) ≤ L(λ∗ ) + ε, where the second inequality uses ex ≤ 1 + (1/ ln 2)x for x ∈ [0, ln 2]. [sent-170, score-1.019]
17 Lemma 2 provides the desired bound on the number of rounds: 2(ε ln 2)1−2·3 B2·3 < 13B6 ε−5 . [sent-202, score-0.376]
18 ϒ(x) = 1+x ln 1−x 2322 T HE R ATE OF C ONVERGENCE OF A DA B OOST It is known (R¨ tsch and Warmuth, 2005; Rudin et al. [sent-216, score-0.394]
19 S is the same as AdaBoost, except that at the end of each round, the current combination of weak hypotheses is scaled back, that is, multiplied by a scalar in [0, 1] if doing so will reduce the exponential loss further. [sent-243, score-0.432]
20 However, after creating the new jt on each round to form a new combination λ = λ ˜ t , the result is multiplied by the value st in [0, 1] that causes the greatest decrease in combination λ ˜ ˜ ˜ the exponential loss: st = argmins L(sλt ), and then we assign λt = st λt . [sent-245, score-0.789]
21 Then the number of rounds required to achieve a target loss L∗ is at least inf { λ 1 : L(λ) ≤ L∗ } /(2 ln m). [sent-282, score-0.78]
22 Unless both margins lie in [− ln m, ln m], one of them will be smaller than − ln m. [sent-285, score-1.202]
23 But then the exponential loss L(λt ) = t (1/m) ∑ j e−(Mλ ) j in that round will exceed 1, a contradiction since the losses are non-increasing through rounds, and the loss at the start was 1. [sent-286, score-0.521]
24 Thus, assigning one of these examples the index i, we have the absolute margin |(Mλt )i | is bounded by ln m in any round t. [sent-287, score-0.702]
25 Letting M(i) denote the ith row of M, the step length αt in round t therefore satisfies |αt | = Mi jt αt = M(i), αt e jt = (Mλt )i − (Mλt−1 )i ≤ (Mλt )i + (Mλt−1 )i ≤ 2 ln m, and the statement of the lemma directly follows. [sent-288, score-0.669]
26 Therefore, by Lemma 9, the minimum number of rounds required for reaching loss at most 2/m + ε m−2 −1 is at least 2 2 ln m ln(1/ε). [sent-306, score-0.68]
27 01), and λ∗ is some vector with loss L∗ + ε/(2m), AdaBoost takes at least Ω(2m / ln m) steps to get within ε/(2m) of the loss achieved by λ∗ . [sent-309, score-0.648]
28 Continuing this way, if the margin of example i is at least ln(1/ε), then λi ≥ ln 1 1 + λi+1 + . [sent-331, score-0.544]
29 Therefore, by Lemma 9, the number of rounds required to achieve loss at most 1/2 + ε is at least ln(1/(4ε))ν−1 / ln(m). [sent-373, score-0.371]
30 Therefore, any solution λ = (λ1 , λ2 ) achieving at most 1/2 + ε loss overall must achieve a margin of at least ln(1/(4ε)) on both the third and fourth examples. [sent-379, score-0.481]
31 By inspecting these two rows, this implies λ2 − λ1 + λ1 ν ≥ ln (1/(4ε)) , λ1 − λ2 + λ2 ν ≥ ln (1/(4ε)) . [sent-380, score-0.664]
32 Adding the two equations we find ν(λ1 + λ2 ) ≥ 2 ln (1/(4ε)) =⇒ λ1 + λ2 ≥ 2ν−1 ln (1/(4ε)) . [sent-381, score-0.664]
33 Note that if ν = 0, then an optimal solution is found in zero rounds of boosting and has optimal loss 1. [sent-383, score-0.548]
34 In fact, by Theorem 12, the number of rounds required to achieve any fixed loss below 1 grows as Ω(1/ν), which is arbitrarily large when ν is infinitesimal. [sent-385, score-0.402]
35 In the next section we investigate the optimal dependence of the rate on the parameter ε and show that Ω(1/ε) rounds are necessary in the worst case. [sent-391, score-0.351]
36 Additionally, we show that this dependence on ε is optimal in Lemma 31 of the Appendix, where Ω(1/ε) rounds are shown to be necessary for converging to within ε of the optimal loss on a certain data set. [sent-398, score-0.443]
37 In this situation, Freund and Schapire (1997) and Schapire and Singer (1999) show that the optimal loss is zero, that no solution with finite size can achieve this loss, but AdaBoost achieves at most ε loss within O(ln(1/ε)) rounds. [sent-406, score-0.46]
38 There exists some positive constant γ > 0, and some vector η † with unit ℓ1 -norm η † 1 = 1 that attains at least γ margin on each example in Z, and exactly zero margin on each example in F ∀i ∈ Z : (Mη † )i ≥ γ, ∀i ∈ F : (Mη † )i = 0. [sent-416, score-0.468]
39 There is a constant µmax < ∞, such that for any combination η with bounded loss on the finitemargin set, specifically obeying ∑i∈F e−(Mη)i ≤ m, the margin (Mη)i for any example i in F lies in the bounded interval [− ln m, µmax ]. [sent-422, score-0.736]
40 The decomposition lemma immediately implies that the vector η ∗ + ∞ · η † , which denotes η ∗ + cη † in the limit c → ∞, is an optimal solution, achieving zero loss on the zero-loss set, but only finite margins (and hence positive losses) on the finite-margin set (thereby justifying the names). [sent-424, score-0.646]
41 This data set also serves as a lower-bound example in Lemma 31, where we show that 2/(9ε) rounds are necessary for AdaBoost to achieve loss at most (2/3) + ε, where the infimum of the loss is 2/3. [sent-431, score-0.529]
42 i∈S In the rest of the section, by “loss” we mean the unnormalized loss ℓλ (X) = mL(λ) and show that in C/ε rounds AdaBoost converges to within ε of the optimal unnormalized loss infλ ℓλ (X), henceforth denoted by K. [sent-440, score-0.584]
43 This implies that the optimal loss K is at least 1 (since we are in the nonseparable case, and any solution will get non-positive margin on some example in F), a fact that we will use later in the proofs. [sent-457, score-0.443]
44 The last inequality holds because θ ≤ m, which follows since the total loss K + θ at any stage of boosting is less than the initial loss m. [sent-512, score-0.415]
45 , is a sequence of combinations of weak classifiers such that Z is its zero loss set, and X \ Z its finite margin set, that is, (15) holds with respect to the entire sequence itself. [sent-556, score-0.545]
46 Then there is a combination η † of weak classifiers that achieves positive margin on every example in Z, and zero margin on every example in its complement X \ Z, that is: (Mη † )i >0 =0 if i ∈ Z, if i ∈ X \ Z. [sent-557, score-0.633]
47 First, we find a unit vector η ′ that we will show has a positive margin on a non-empty subset S of Z and zero margins on X/Z. [sent-567, score-0.462]
48 Since translating a vector along the null space of M, ker M = {x : Mx = 0}, has no effect on the margins produced by the vector, assume without loss of generality that the η(t) ’s are orthogonal to ker M. [sent-568, score-0.604]
49 Then firstly η ′ achieves non-negative margin on every example; otherwise by continuity for some extremely large t, the margin of η(t) / η(t) on that example is also negative and bounded away from zero, and therefore η(t) ’s loss is more than m, which is a contradiction to admissibility. [sent-572, score-0.63]
50 Secondly, the margin of η ′ on each example in X \ Z is zero; otherwise, by continuity, for arbitrarily large t the margin of η(t) / η(t) on an example in X \ Z is positive and bounded away from zero, and hence that example attains arbitrarily small loss in the sequence, a contradiction to (15). [sent-573, score-0.665]
51 Therefore η ′ must achieve positive margin on some non-empty subset S of Z, and zero margins on every other example. [sent-575, score-0.485]
52 The set X ′ is smaller than the set X, and thus the inductive hypothesis holds for X ′ , meaning that there exists some η ′′ that achieves positive margins on every element in Z ′ , and zero margins on every element in X ′ \ Z ′ = X \ Z. [sent-579, score-0.531]
53 Therefore, by setting η † = η ′ + cη ′′ for a suitable c, we can achieve a positive margin on every element in S ∪ Z ′ = Z, and zero margins on every element in X \ Z, completing the proof. [sent-580, score-0.485]
54 Now, applying Lemma 20 to the sequence η(t) yields some convex combination η † having margin at least γ > 0 (for some γ) on Z and zero margin on its complement, proving Item 1 of the decomposition lemma. [sent-584, score-0.547]
55 Lemma 22 There is a constant µmax < ∞, such that for any combination η that achieves bounded loss on the finite-margin set, ℓη (F) ≤ m, the margin (Mη)i for any example i in F lies in the bounded interval [− ln m, µmax ] . [sent-600, score-0.784]
56 Proof Since the loss ℓη (F) is at most m, therefore no margin may be less than − ln m. [sent-601, score-0.702]
57 Suppose arbitrarily large margins are producible by bounded loss vectors, that is arbitrarily large elements are present in the set {(Mη)i : ℓη (F) ≤ m, 1 ≤ i ≤ m}. [sent-603, score-0.426]
58 Then for some fixed example x ∈ F there exists a sequence of combinations of weak classifiers, whose t th element achieves more than margin t on x but has loss at most m on F. [sent-604, score-0.547]
59 Applying Lemma 19 we can find a subsequence λ(t) whose tail achieves vanishingly small loss on some non-empty subset S of F containing x, and bounded margins in F \ S. [sent-605, score-0.454]
60 Applying Lemma 20 to λ(t) we get some convex combination λ† which has positive margins on S and zero margin on F \ S. [sent-606, score-0.496]
61 Then η ∗ + ∞ · λ† achieves the same loss on every example in F \ S as the optimal solution η ∗ , but zero loss for examples in S. [sent-608, score-0.513]
62 By Item 1 of the decomposition lemma, the zero loss set in this data set is empty since there cannot be an η † with both positive and zero margins and no negative margins. [sent-650, score-0.474]
63 Next, we choose a combination with at most m loss that nevertheless achieves Ω(2m /m) positive margin on some example. [sent-652, score-0.473]
64 For m > 10, the choice 2337 M UKHERJEE , RUDIN AND S CHAPIRE of ε guarantees 1/(2m) = ε ≥ (ln m)/(2m−1 − 1), so that the loss on the example corresponding m−1 to the bottom most row is e−ε(2 −1) ≤ e− ln m = 1/m. [sent-666, score-0.49]
65 Then each of the quantities λ−1 , γ−1 and µmax min are at most 2O(m ln m) . [sent-673, score-0.353]
66 Lemma 25 and CorolO(m ln m) lary 23 together imply a convergence rate of 22 /ε to the optimal loss for integer matrices. [sent-675, score-0.631]
67 In the next section we will see how to obtain poly(2m ln m , ε−1 ) bounds, although with a worse dependence on ε. [sent-678, score-0.371]
68 When the feature matrix has only −1, 0, +1 entries, we may bound η ∗ 1 ≤ 2O(m ln m) . [sent-681, score-0.439]
69 Proof Note that every entry of MF η ∗ lies in the range [− ln m, µmax = 2O(m ln m) ], and hence MF η ∗ ≤ 2O(m ln m) . [sent-682, score-0.996]
70 Next, we may choose η ∗ orthogonal to the null space of MF ; then η ∗ ≤ λ−1 MF η ∗ ≤ min √ 2O(m ln m) . [sent-683, score-0.353]
71 We first show how the finite rate bound of Theorem 1 along with the decomposition lemma yields a new rate of convergence to the optimal loss. [sent-689, score-0.42]
72 Lemma 27 When the feature matrix has −1, 0, +1 entries, for any ε > 0, there is some solution with ℓ1 -norm at most 2O(m ln m) ln(1/ε) that achieves within ε of the optimal loss. [sent-693, score-0.516]
73 Then η ∗ − cη † produces non-negative margins on Z, since Mη ∗ − cMη † ≥ 0, and it attains the optimal margins on the finite 2338 T HE R ATE OF C ONVERGENCE OF A DA B OOST margin set F, since Mη † = 0 on F. [sent-696, score-0.652]
74 Therefore, the vector λ∗ = η ∗ + ln(1/ε)γ−1 − c η † achieves at least ln(1/ε) margin on every example in Z, and optimal margins on the finite loss set F. [sent-697, score-0.652]
75 We may now invoke Theorem 1 to obtain a 2O(m ln m) ln6 (1/ε)ε−5 rate of convergence to the optimal solution. [sent-701, score-0.473]
76 In that way we may obtain a poly(λ−1 , γ−1 , µmax )ε−3 = 2O(m ln m) ε−3 rate bound. [sent-703, score-0.4]
77 Finally, note that if Conjecture 6 is true, then Lemma 27 provides a bound for B in Conjecture 6, implying a 2O(m ln m) ln(1/ε)ε−1 rate bound for converging to the optimal loss, which is nearly optimal in both m and ε. [sent-705, score-0.544]
78 Conjecture 28 For feature matrices with −1, 0, +1 entries, AdaBoost converges to within ε of the optimal loss within 2O(m ln m) ε−(1+o(1)) rounds. [sent-707, score-0.55]
79 Then, the number of rounds required by such a procedure to achieve a target loss φ∗ is at least inf { λ 1 : L(λ) ≤ φ∗ } /(2 + 2 ln m). [sent-718, score-0.78]
80 Proof It suffices to upper-bound the step size |αt | in any round t by at most 2 + 2 ln m; as long as the step size is smaller than or equal to 2+2 ln m, it will take at least inf { λ 1 : L(λ) ≤ φ∗ } /(2+2 ln m) steps for λ 1 to be at least inf { λ 1 : L(λ) ≤ φ∗ }. [sent-719, score-1.181]
81 To see this, recall that (3) shows that a step αt causes the loss to change by a factor of f (αt ) given by: f (αt ) = (1 + rt ) −αt (1 − rt ) αt e + e , 2 2 where rt denotes the correlation in direction jt in which the step is taken. [sent-720, score-1.022]
82 Since f (αt ) = 1 at αt = 0 and αt = ln ((1 + rt )/(1 − rt )), the desired maximum magnitude is the latter value. [sent-724, score-0.872]
83 Therefore, |αt | ≤ ln 1 + rt 1 − rt = ln 1 + |rt | 1 − |rt | 2339 = ln 1 + δt 1 − δt . [sent-725, score-1.536]
84 Now the step length can be bounded as |αt | ≤ ln 1 + δt 1 − δt = 2 ln(1 + δt ) − ln(1 − δt2 ) ≤ 2δt + 2 ln m ≤ 2 + 2 ln m. [sent-730, score-0.996]
85 We end by showing a new lower bound for the convergence rate to an arbitrary target loss studied in Section 3. [sent-731, score-0.374]
86 1 Proof The decomposition lemma implies that λ∗ = η ∗ + ln(2/ε)η † with ℓ1 -norm O(ln(1/ε)) achieves loss at most K + ε/2 (recall K is the optimal loss). [sent-739, score-0.401]
87 We showed upper and lower bounds for convergence rates to both an arbitrary target loss achieved by some finite combination of the weak hypotheses, as well as to the infimum loss which may not be realizable. [sent-744, score-0.544]
88 For the first convergence rate, we showed a strong relationship exists between the size of the minimum vector achieving a target loss and the number of rounds of coordinate descent required to achieve that loss. [sent-745, score-0.581]
89 2341 M UKHERJEE , RUDIN AND S CHAPIRE Proof Note that the optimal loss is 2/3, and we are bounding the number of rounds necessary to get within (2/3) + ε loss for ε < 1/3. [sent-766, score-0.534]
90 Let wta , wtb , wtc denote the normalized-losses (adding up to 1) or weights on examples a, b, c at the beginning of round t, ht the weak hypothesis chosen in round t, and δt the edge in round t. [sent-768, score-0.774]
91 Assume without loss of generality that 1/2 = wta > wtb > wtc ; note this implies wtb = 1 − (wta + wtc ) = 1/2 − wtc . [sent-779, score-0.549]
92 Therefore, the loss after T rounds is √ 1 2 2 2 2 T +1 2 1 2 1 , = 1+ = + 1+ ≥ 3 2 T 3 T 3 3T 3 9T where the inequality holds for T ≥ 1. [sent-789, score-0.371]
93 Since the initial error is 1 = (2/3) + 1/3, therefore, for any ε < 1/3, the number of rounds needed to achieve loss (2/3) + ε is at least 2/(9ε). [sent-790, score-0.371]
94 To show (17), we first rearrange the condition (16) on ut , ut+1 to obtain ut+1 ≤ ut 1 − c0 utp =⇒ ut 1 1 = exp c0 utp , ≥ p ≥ ut+1 1 − c0 ut exp −c0 utp where the last inequality again uses the exponential inequality. [sent-806, score-0.769]
95 Notice in dividing by ut+1 and (1 − c0 utp ), the inequality does not flip since both terms are positive: ut , ut+1 are positive according to the conditions of the lemma, and (1 − c0 utp ) is positive because of the inequality on the left side of the implication in the above. [sent-807, score-0.432]
96 Indeed, by expanding out xk + Bx−k 2 as inner product with itself, we have xk + Bx−k 2 = xk 2 + Bx−k 2 + 2 xk , Bx−k ≥ xk 2 + 2 x−k 2 ≥ x 2, where the first inequality follows since x ∈ ker A⊥ implies xk , Bx−k = x−k , x−k . [sent-847, score-0.461]
97 F Proof Item 3 of the decomposition lemma states that whenever the loss ℓx (F) of a vector is bounded by m, then the largest margin maxi∈F (MF x)i is at most µmax . [sent-894, score-0.537]
98 This implies that there is no vector x such that MF x ≥ 0 and at least one of the margins (MF x)i is positive; otherwise, an arbitrarily large multiple of x would still have loss at most m, but margin exceeding the constant µmax . [sent-895, score-0.607]
99 Since the loss of λ on F is at most m, each margin on F is at least − ln m, and therefore maxi∈F −AT λ i ≤ ln m. [sent-910, score-1.034]
100 Hence, the margin of example i can be bounded as (Mλ)i = yT , −AT λ ≤ ln m y 1 . [sent-911, score-0.544]
wordName wordTfidf (topN-words)
[('adaboost', 0.398), ('ln', 0.332), ('mf', 0.319), ('rt', 0.27), ('margin', 0.212), ('margins', 0.206), ('rounds', 0.19), ('st', 0.164), ('loss', 0.158), ('oost', 0.15), ('rudin', 0.146), ('utp', 0.14), ('chapire', 0.128), ('ukherjee', 0.128), ('lemma', 0.124), ('schapire', 0.12), ('ker', 0.12), ('onvergence', 0.107), ('round', 0.105), ('poly', 0.103), ('ate', 0.1), ('hypotheses', 0.087), ('item', 0.085), ('weak', 0.083), ('wtc', 0.08), ('boosting', 0.076), ('da', 0.072), ('exponential', 0.07), ('rate', 0.068), ('edge', 0.067), ('ut', 0.064), ('tsch', 0.062), ('bx', 0.057), ('jt', 0.054), ('dt', 0.054), ('conjecture', 0.054), ('examples', 0.053), ('xk', 0.053), ('wta', 0.051), ('wtb', 0.05), ('achieves', 0.048), ('ht', 0.048), ('rows', 0.047), ('coordinate', 0.045), ('convergence', 0.045), ('solution', 0.045), ('bound', 0.044), ('decomposition', 0.043), ('achieving', 0.043), ('subsequence', 0.042), ('descent', 0.04), ('progress', 0.04), ('entries', 0.04), ('inf', 0.04), ('proof', 0.039), ('dependence', 0.039), ('mx', 0.038), ('target', 0.037), ('adj', 0.034), ('combination', 0.034), ('max', 0.033), ('freund', 0.033), ('ax', 0.033), ('feature', 0.032), ('matrix', 0.031), ('suboptimality', 0.031), ('arbitrarily', 0.031), ('theorem', 0.03), ('losses', 0.03), ('norm', 0.029), ('bounds', 0.029), ('rescaled', 0.028), ('optimal', 0.028), ('polynomial', 0.028), ('corollary', 0.028), ('hypothesis', 0.027), ('mi', 0.027), ('worst', 0.026), ('prove', 0.026), ('rated', 0.026), ('onoda', 0.026), ('ak', 0.026), ('columns', 0.025), ('unnormalized', 0.025), ('reference', 0.025), ('doubly', 0.024), ('zero', 0.023), ('inequality', 0.023), ('summation', 0.023), ('sequence', 0.023), ('luo', 0.023), ('achieve', 0.023), ('combinations', 0.023), ('singer', 0.023), ('princeton', 0.022), ('showing', 0.022), ('ces', 0.022), ('det', 0.022), ('min', 0.021), ('positive', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
2 0.24911797 8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
3 0.15733138 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
Author: Philip M. Long, Rocco A. Servedio
Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers
4 0.12812057 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
5 0.12310771 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
6 0.11856709 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
7 0.11222787 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
8 0.085559323 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
9 0.073711008 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation
10 0.071636237 76 jmlr-2013-Nonparametric Sparsity and Regularization
11 0.06238018 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
12 0.0620181 68 jmlr-2013-Machine Learning with Operational Costs
13 0.059547722 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
14 0.053834636 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
15 0.051935025 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
16 0.049035087 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
17 0.048072398 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
18 0.046883773 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
19 0.046748653 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
20 0.046106987 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
topicId topicWeight
[(0, -0.265), (1, 0.168), (2, 0.112), (3, 0.206), (4, -0.075), (5, -0.214), (6, 0.239), (7, -0.169), (8, -0.172), (9, -0.053), (10, 0.2), (11, 0.203), (12, -0.038), (13, -0.099), (14, 0.009), (15, -0.062), (16, -0.014), (17, 0.065), (18, 0.002), (19, 0.007), (20, 0.133), (21, 0.043), (22, 0.033), (23, -0.013), (24, -0.026), (25, 0.06), (26, 0.113), (27, -0.075), (28, 0.004), (29, -0.048), (30, -0.029), (31, -0.006), (32, 0.028), (33, -0.027), (34, 0.002), (35, 0.079), (36, 0.017), (37, 0.012), (38, 0.014), (39, -0.005), (40, -0.069), (41, 0.024), (42, -0.035), (43, 0.011), (44, -0.023), (45, 0.024), (46, 0.055), (47, -0.064), (48, -0.008), (49, -0.078)]
simIndex simValue paperId paperTitle
same-paper 1 0.95701069 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
2 0.82679832 8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
3 0.68358666 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
Author: Philip M. Long, Rocco A. Servedio
Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers
4 0.54875302 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
5 0.54367161 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
6 0.44050172 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
7 0.4128809 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
8 0.39932275 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
9 0.36070454 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation
10 0.34782222 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
11 0.34504557 68 jmlr-2013-Machine Learning with Operational Costs
12 0.31853557 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
13 0.31819016 73 jmlr-2013-Multicategory Large-Margin Unified Machines
14 0.31600609 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
15 0.3139284 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
16 0.30793992 76 jmlr-2013-Nonparametric Sparsity and Regularization
17 0.30744642 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
18 0.29993039 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
19 0.29861641 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
20 0.29197153 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
topicId topicWeight
[(0, 0.021), (5, 0.3), (6, 0.036), (9, 0.01), (10, 0.06), (14, 0.021), (17, 0.218), (20, 0.013), (23, 0.027), (68, 0.021), (70, 0.033), (75, 0.031), (85, 0.079), (87, 0.022), (89, 0.014), (93, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.88479435 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
2 0.81100601 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
Author: Naftali Harris, Mathias Drton
Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution
3 0.80866611 15 jmlr-2013-Bayesian Canonical Correlation Analysis
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
4 0.8035804 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
Author: Bruno Scherrer
Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms
5 0.80215806 8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
6 0.79213631 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
7 0.79062301 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
8 0.78751922 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
9 0.78611147 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java
10 0.78574342 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
11 0.78366119 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
12 0.78316379 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
13 0.77859128 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
14 0.77658498 73 jmlr-2013-Multicategory Large-Margin Unified Machines
16 0.77542049 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
17 0.77397513 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
18 0.76819134 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
19 0.76226133 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
20 0.76120561 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion