jmlr jmlr2005 jmlr2005-29 knowledge-graph by maker-knowledge-mining

29 jmlr-2005-Efficient Margin Maximizing with Boosting


Source: pdf

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. [sent-8, score-0.272]

2 The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. [sent-12, score-0.326]

3 The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. [sent-13, score-0.311]

4 This base hypothesis is used to update the distribution on the examples: The algorithm increases the weights of those examples that were wrongly classified by the base hypothesis. [sent-21, score-0.302]

5 At the end of each stage the base hypothesis is added to the linear combination, and the sign of this linear combination forms the current hypothesis of the boosting algorithm. [sent-22, score-0.447]

6 It is ”adaptive” in that the linear coefficient of the base hypothesis depends on the weighted error of the base hypothesis at the time when the base hypothesis was added to the linear combination. [sent-46, score-0.634]

7 The signed linear combination can be viewed as a homogeneous hyperplane in a feature space, where each base hypothesis represents one feature or dimension. [sent-52, score-0.283]

8 , 2001) that improve with the margin of the classifier, which is defined as the size of the minimum margin of the examples. [sent-60, score-0.41]

9 In this paper we present an algorithm that produces a final hypothesis with margin at least ρ∗ − ν, where ρ∗ is the unknown maximum margin achievable by any convex combination of base hypotheses and ν a precision parameter. [sent-69, score-0.864]

10 (2001); R¨ tsch and Warmuth a a (2002)): When the parameter ρ of AdaBoostρ is set to ρ∗ − ν, then after 2 ln2N iterations, where ν N is the number of examples, the margin of the produced linear combination is guaranteed to be at least ρ∗ − ν. [sent-72, score-0.365]

11 In a preliminary conference paper (R¨ tsch and Warmuth, 2002) we used AdaBoostρ iteratively in a binary search like fashion: a log2 (2/ν) calls to AdaBoostρ are guaranteed to produce a margin at least ρ∗ − ν. [sent-74, score-0.353]

12 The new algorithm AdaBoost∗ uses the parameter ν and a current estimate of the achievable margin in the computation ν of the linear coefficients of the base learners. [sent-79, score-0.326]

13 Except for the algorithm presented in the previous conference paper, this is the first result on the fast convergence of a boosting algorithm to the maximum margin solution that works for all ρ∗ ∈ [−1, 1]. [sent-80, score-0.295]

14 Using previous results one can only show that AdaBoost asymptotically converges to 2132 E FFICIENT M ARGIN M AXIMIZATION WITH B OOSTING a final hypothesis with margin at least ρ∗ /2 if ρ∗ > 0 and if subtle conditions on the chosen base hypotheses are satisfied (cf. [sent-81, score-0.548]

15 Recently other versions of AdaBoost have been published that are guaranteed to produce a linear combination of margin at least ρ∗ − ν after Ω(ν−3 ) iterations (Rudin et al. [sent-83, score-0.301]

16 The original AdaBoost was designed to find a final hypothesis of margin at least zero. [sent-88, score-0.327]

17 Our algorithm is also useful when the base hypotheses given to the Boosting algorithm are strong in the sense that they already separate the data and have margin greater than zero, but less than one. [sent-94, score-0.426]

18 The big advantage of this algorithm is an absolute bound on the number of iterations: After 2 ln N iterations AdaBoost∗ is guaranteed to produce a hypothesis with margin at least ρ∗ − ν. [sent-97, score-0.484]

19 First, we prove that if the weighted 1 1 training error of the t-th base hypothesis is εt = 2 − 2 γt , then an upper bound on the fraction of 1 examples with margin smaller than ρ is reduced by a factor of 1 − 2 (ρ − γt )2 at stage t of AdaBoostρ (cf. [sent-104, score-0.464]

20 We show that after roughly 2 ln2N iterations our new one-pass algorithm AdaBoost∗ is guaranteed to produce a linear ν ν combination with margin at least ρ∗ − ν. [sent-110, score-0.301]

21 This strengthens the results of our preliminary conference paper (R¨ tsch and Warmuth, 2002), which had an additional log2 (2/ν) factor in the total number a times the weak learner is called and much higher constants. [sent-111, score-0.269]

22 In each iteration t, a base hypothesis ht ∈ H with a non-negative coefficient αt is added to the linear combination. [sent-127, score-0.33]

23 We denote the combined hypothesis as follows (Note that we normalized the weights): αt ht (x), T t=1 ∑r=t αr T f˜α (x) = sign fα (x), where fα (x) = ∑ ht (x) ∈ H , and αt ≥ 0 . [sent-128, score-0.354]

24 The “black box” that selects the base hypothesis in each iteration is called the weak learner. [sent-129, score-0.304]

25 For AdaBoost, it has been shown that if the weak learner is guaranteed to select base hypotheses of weighted error slightly below 50%, then the combined hypothesis is consistent with the training set in a small number of iterations (Freund and Schapire, 1997). [sent-130, score-0.611]

26 Since at most one new base hypothesis is added in each iteration, the size of the final hypothesis is bounded by the number of iterations. [sent-132, score-0.326]

27 , 1998) it was also shown that a bound on the generalization error decreases with the size of the margin of the final hypothesis f . [sent-135, score-0.344]

28 Thus the margin quantifies by how far this example is on the yn side of the hyperplane f˜. [sent-140, score-0.328]

29 The margin of the combined hypothesis f is the minimum margin of all N examples. [sent-142, score-0.558]

30 The goal of this paper is to find a small non-negative linear combination of base hypotheses from H with margin close to the maximum achievable margin. [sent-143, score-0.496]

31 AdaBoostρ and AdaBoost∗ ν The original AdaBoost was designed to find a consistent hypothesis f˜ which is defined as a signed linear combination f with margin greater zero. [sent-145, score-0.386]

32 We start with a slight modification of AdaBoost, which finds (if possible) a linear combination of base learners with margin ρ, where ρ is a parameter (cf. [sent-146, score-0.318]

33 The only difference from AdaBoost is the choice of 1+ρ 1 the hypothesis coefficients αt : An additional term − 2 ln 1−ρ appears in the expression for the hypothesis coefficient αt . [sent-167, score-0.319]

34 Assume the weak learner returns the constant hypothesis ht (x) ≡ 1. [sent-179, score-0.408]

35 For a more general base hypothesis ht with continuous range [−1, +1], choosing αt such that Zt as a function of αt is minimized, guarantees that the edge of ht with respect to the distribution dt+1 is ρ. [sent-185, score-0.52]

36 This equals the iteration bound for the best ν algorithm we know of for the case when ρ∗ is known and we seek a linear combination of margin at least ρ∗ − ν: AdaBoostρ with parameter ρ = ρ∗ − ν. [sent-207, score-0.276]

37 , T , (a) Train classifier on {S, dt } and obtain hypothesis ht : x → [−1, 1] (b) Calculate the edge γt of ht : γt = N t ∑ dn yn ht (xn ) n=1 (c) if |γt | = 1, then α1 = sign(γt ), h1 = ht , T = 1; break min min (d) γt = min γr ; ρt = γt − ν r=1,. [sent-226, score-0.886]

38 ,t 1 1 + γt 1 1 + ρt ln − ln 2 1 − γt 2 1 − ρt d t exp (−αt yn ht (xn )) t+1 (f) Update weights: dn = n , Zt t where Zt = ∑N dn exp (−αt yn ht (xn )) n=1 (e) Set αt = T 4. [sent-229, score-0.746]

39 1 Weak learning and margins The standard assumption made on the weak learning algorithm for the PAC analysis of Boosting algorithm is that the weak learner returns a hypothesis h from a fixed set H that is slightly better 1 than random guessing. [sent-234, score-0.464]

40 In Boosting this is extended to weighted example sets and the error is defined as εh (d) = N ∑ dn I(yn = h(xn )), n=1 where h is the hypothesis returned by the weak learner and I is the indicator function with I(true) = 1 and I(false) = 0. [sent-238, score-0.366]

41 2137 ¨ R ATSCH AND WARMUTH When the range of a hypothesis h is the entire interval [−1, +1], then the edge γh (d) = ∑N dn yn h(xn ) n=1 is a more convenient quantity for measuring the quality of h. [sent-243, score-0.402]

42 2 Recall from Section 2 that the margin of a given example (xn , yn ) is defined as yn fα (xn ). [sent-245, score-0.411]

43 Thus, the minimum edge γ∗ that can be achieved over all possible distributions d of the training set is equal to the maximum margin ρ∗ of any linear combination of hypotheses from H . [sent-256, score-0.485]

44 ,N h∈H In particular, if the weak learning algorithm is guaranteed to return a hypothesis with edge at least γ for any distribution on the examples, then γ∗ ≥ γ and by the above duality there exists a combined hypothesis with margin at least γ. [sent-261, score-0.707]

45 If γ is equal to its upper bound γ∗ then there exists a combined hypothesis with margin exactly γ = ρ∗ that only uses hypotheses that are actually returned by the weak learner in response to certain distributions on the examples. [sent-262, score-0.664]

46 From this discussion we can derive a sufficient condition on the weak learning algorithm to reach the maximum margin (for the case when H finite). [sent-263, score-0.282]

47 If the weak learner returns hypotheses whose edges are at least γ∗ , then there exists a linear combination of these hypotheses that attains a margin γ∗ = ρ∗ . [sent-264, score-0.721]

48 Lemma 2 For any ρ ∈ [−1, 1], the final hypothesis fα of AdaBoost{ρt } satisfies the following inequality: 1 N ∑ I (yn fα (xn ) ≤ ρ) ≤ N n=1 T ∏ Zt exp t=1 T ∑ ραt t=1 T = ∏ exp {ραt + ln Zt } (3) t=1 1+ρ 1 1 t where Zt = ∑N dn exp (−αt yn ht (xn )) and αt = 2 ln 1+γtt − 2 ln 1−ρtt . [sent-294, score-0.695]

49 Then for all ρ ∈ [−1, 1], exp {ραt + ln Zt } ≤ exp − 1 + ρt 1+ρ ln 2 1 + γt − 1 − ρt 1−ρ ln 2 1 − γt . [sent-300, score-0.275]

50 Now we determine an upper bound on the number of iterations needed by AdaBoostρ for achieving a margin of ρ on all examples, given that the maximum margin is ρ∗ : Corollary 4 Assume the weak learner always returns a base hypothesis with an edge γt ≥ ρ∗ . [sent-310, score-0.974]

51 If 0 ≤ ρ ≤ ρ∗ − ν, ν > 0, then AdaBoostρ will converge to a solution with margin at least ρ on all examples in at most 2 ln N(1−ρ2 ) ν2 iterations. [sent-311, score-0.296]

52 Proof By Lemma 2 and (4), (5): T 1 (ρ − γt )2 1 N I (yn f (xn ) ≤ ρ) < ∏ 1 − ∑ N n=1 2 1 − ρ2 t=1 ≤ 1− The margin is at least ρ for all examples, if the rhs is smaller than ln N ν − ln 1 − 1 1−ρ2 2 2 ≤ 1 N; 1 ν2 2 1 − ρ2 T . [sent-312, score-0.426]

53 3 Asymptotic Margin of AdaBoostρ With the methods shown so far we can determine the asymptotic value of margin of the hypothesis produced by the original AdaBoost algorithm. [sent-318, score-0.327]

54 In a second part we consider an experiment that shows that depending on some subtle properties of the weak learner, the margin of combined hypotheses generated by AdaBoost can converge to quite different values (while the maximum margin is kept constant). [sent-321, score-0.652]

55 Then the margin ρ of the combined hypothesis is bounded from below by ˆ ρ≥ ln(1 − ρ2 ) − ln(1 − (γmin )2 ) ln 1+γmin 1−γmin − ln 1+ρ 1−ρ . [sent-348, score-0.503]

56 A Taylor expansion of min the rhs of (7) shows that the margin is lower bounded by γ 2+ρ . [sent-353, score-0.295]

57 We analyze two different settings: (i) the weak learner selects the hypothesis with largest edge over all hypotheses (i. [sent-364, score-0.526]

58 the best case) and (ii) the weak learner selects the hypothesis with minimum edge among all hypotheses with edge larger than ρ∗ (i. [sent-366, score-0.636]

59 Corollary 5 holds for both cases since the weak learner is allowed to return any hypothesis with edge larger than ρ∗ . [sent-369, score-0.387]

60 We do not allow duplicate hypotheses or hypotheses that agree with the labels on all examples. [sent-391, score-0.278]

61 On the abscissa is the maximum achievable margin ρ∗ 3 and on the ordinate the margin achieved by AdaBoostρ for one data realization. [sent-395, score-0.473]

62 The margin of the worst strategy is very close to the lower bound (7) and the best strategy has near maximum margin. [sent-398, score-0.278]

63 If ρ is chosen slightly below the maximum achievable margin then this gap is reduced to 0. [sent-399, score-0.287]

64 2 D ECREASING THE S TEP S IZE Breiman (1999) conjectured that the inability to maximize the margin is due to the fact that the normalized hypothesis coefficients may “circulate endlessly through the convex set”, which is defined by the lower bound on the margin. [sent-407, score-0.37]

65 In fact, motivated from our previous experiments, it seems possible to implement a weak learner that appropriately switches between optimal and worst case performance, leading to non-convergent normalized hypothesis coefficients. [sent-408, score-0.307]

66 Thus if the weak learner always returns hypotheses with edges γt ≥ ρ∗ (t = 1, 2, . [sent-416, score-0.346]

67 4 Convergence of AdaBoost∗ ν The AdaBoost∗ algorithm is based on two insights: ν • According to the discussion after Lemma 3, the most rapid convergence to a combined hypothesis with margin ρ∗ − ν occurs for AdaBoostρ when one chooses ρt as close as possible to ρ∗ − ν. [sent-422, score-0.353]

68 the weak learner achieves a small edge), the edge γt will be close to ρ∗ . [sent-425, score-0.28]

69 This forces an acceleration of the convergence to a large margin and leads to distributions for which the weak learner has to return small edges. [sent-430, score-0.36]

70 Note that if the weak learner always returns hypotheses with edge γt = ρ∗ which is the worst case under the assumption that γt ≥ ρ∗ , then ρt = ρ∗ − ν in each iteration. [sent-431, score-0.462]

71 We will now state and prove our main theorem: Theorem 6 Assume the weak learner always returns a base hypothesis with an edge γt ≥ ρ∗ . [sent-436, score-0.497]

72 Then after 2 ln2N iterations AdaBoost∗ (Algorithm 2) is guaranteed to produce a combined hypothesis f of ν ν margin at least ρ∗ − ν. [sent-437, score-0.418]

73 Also note, if the output of the hypotheses is discrete, the hypothesis space is effectively finite (R¨ tsch et al. [sent-454, score-0.375]

74 However, Lemma 3 and Theorem 6 do not assume finite hypothesis sets and show that the margin will converge arbitrarily close to ρ∗ , as long as the weak learning algorithm can return a hypothesis in each iteration that has an edge not smaller than ρ∗ . [sent-468, score-0.659]

75 By assuming that the weak learner is always able to pick good enough hypotheses (≥ ρ∗ ), one automatically gets by Lemma 3 that Γ = 0. [sent-472, score-0.294]

76 Furthermore, since in this case the margin is already maximum (equal to 1), boosting algorithms would stop since γ = 1. [sent-514, score-0.295]

77 However, this amplifies the problem that the resulting hypotheses might in some cases have an edge smaller than the maximum margin, which according to the Min-Max-Theorem should not occur if the weak learner performs optimally. [sent-522, score-0.404]

78 5 on different bootstrap realizations if the edge is smaller than the margin of the current linear combination. [sent-524, score-0.315]

79 Furthermore, for AdaBoost∗ , a small ν edge of one hypothesis can spoil the margin estimate ρt . [sent-525, score-0.437]

80 Marginal AdaBoost as proposed in R¨ tsch and Warmuth (2002) proceeds in stages and first tries a to find an estimate of the margin using a binary search. [sent-534, score-0.319]

81 In a run of Arc-GV for thousand iterations we observe a margin of the combined hypothesis of . [sent-549, score-0.403]

82 In this case the margin for AdaBoost∗ is ν ν larger than the margins of all other algorithms when executed for one thousand iterations. [sent-552, score-0.287]

83 It starts with slightly lower margins in the beginning, but then catches up due the better choice of the margin estimate. [sent-553, score-0.3]

84 The slightly larger margins generated by Marginal AdaBoost can be attributed to the fact that it uses many more calls to the weak learner than AdaBoost∗ and after an estimate of the achievable ν margin is available, it starts optimizing the linear combination using this estimate. [sent-587, score-0.531]

85 The hypothesis produced in the second pass should have a larger margin and use fewer base hypotheses. [sent-589, score-0.409]

86 2 Heuristics for Tuning the Precision Parameter ν Our main results says that after 2 ln N ν2 iterations AdaBoost∗ produces a hypothesis of margin at least ν ρ∗ − ν. [sent-591, score-0.465]

87 T If ν is chosen much larger than νT , then after T iterations AdaBoost∗ often achieves a margin below ν ρ∗ − νT . [sent-593, score-0.27]

88 However, their observations were only due to the improper choice of the ν accuracy parameter ν for AdaBoost∗ : For ν = 10−3 (as chosen in their study), AdaBoost∗ would ν ν need millions of iterations to achieve a guaranteed margin ρ∗ − ν. [sent-599, score-0.27]

89 (2004b)): The number of iterations is T = 20K, the dimension of the examples is N = 50, and we assume that the base learner returns a hypothesis with maximum edge. [sent-608, score-0.376]

90 02 prescribed by our bound, then AdaBoost∗ achieves the margin which is ν significantly larger than the margins achieved by the other algorithms. [sent-610, score-0.302]

91 In the case when the ν base learner returns a random hypothesis with edge only at least as large as ρ∗ , then our algorithm compares even more favorably (not shown). [sent-614, score-0.42]

92 From von Neumann’s Min-Max theorem we know that if the weak learner always returns a hypothesis with weighted classification error less than 1 − 1 γ then the maximum achievable margin ρ∗ is 2 2 at least γ. [sent-617, score-0.585]

93 The asymptotic analysis lead us to a lower bound on the margin of the final hypotheses generated by AdaBoostρ , which was shown to be rather tight in empirical cases. [sent-618, score-0.361]

94 To overcome these problems we provided an algorithm AdaBoost∗ with the following provable ν guarantees: It produces a linear combination with margin at least ρ∗ − ν and the number of base ln hypotheses used in this linear combination is at most 2ν2 n . [sent-620, score-0.592]

95 The new algorithm decreases its estimate ρ of the margin iteratively, such that the gap between the best and the worst case becomes arbitrarily small. [sent-621, score-0.278]

96 Margins First recall the definition of margin used in this paper, which is defined for a fixed set of examples {(xn , yn ) : 1 ≤ n ≤ N} and a set of hypotheses H = {h1 , . [sent-625, score-0.463]

97 It is well known that for a fixed example (xn , yn ) and normal α ∈ RM , the oneM α h (x ) norm margin ∑m=1 M m m| n is the minimum ∞ -distance of the example to the hyperplane with normal ∑m |αm α (Mangasarian, 1999; R¨ tsch et al. [sent-634, score-0.442]

98 1 1 In summary, if the one-norm margin of H is non-negative, then the margin of the closed hypotheses class cl(H ) coincides with the one-norm margin. [sent-651, score-0.549]

99 In this case the hypotheses are α vectors and the edge is J J 1 ∑ β j S j (α) = − 2 ∑ αr αs yr ys ∑ β j k j (xr , xs ) j=1 r,s j=1 + ∑ αi . [sent-671, score-0.276]

100 Analysis of boosting algoritms using the smooth margin function: A study of three algorithms. [sent-870, score-0.295]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('adaboost', 0.861), ('margin', 0.205), ('hypotheses', 0.139), ('hypothesis', 0.122), ('tsch', 0.114), ('edge', 0.11), ('ht', 0.103), ('yn', 0.103), ('boosting', 0.09), ('base', 0.082), ('margins', 0.082), ('rudin', 0.08), ('learner', 0.078), ('weak', 0.077), ('fficient', 0.077), ('ln', 0.075), ('atsch', 0.071), ('rhs', 0.071), ('warmuth', 0.068), ('dn', 0.067), ('aximization', 0.064), ('oosting', 0.064), ('argin', 0.057), ('schapire', 0.051), ('iterations', 0.05), ('gap', 0.043), ('zt', 0.042), ('achievable', 0.039), ('cl', 0.038), ('breiman', 0.037), ('xn', 0.037), ('freund', 0.032), ('hm', 0.031), ('combination', 0.031), ('duality', 0.03), ('worst', 0.03), ('signed', 0.028), ('returns', 0.028), ('marginal', 0.028), ('combined', 0.026), ('sonnenburg', 0.026), ('exp', 0.025), ('edges', 0.024), ('grove', 0.023), ('xr', 0.023), ('iteration', 0.023), ('ensemble', 0.022), ('weighted', 0.022), ('tt', 0.021), ('hyperplane', 0.02), ('nash', 0.019), ('min', 0.019), ('calls', 0.019), ('coef', 0.018), ('mkl', 0.017), ('cruz', 0.017), ('corollary', 0.017), ('bound', 0.017), ('experimentally', 0.016), ('santa', 0.016), ('examples', 0.016), ('precision', 0.016), ('provable', 0.016), ('corrective', 0.015), ('demiriz', 0.015), ('neurocolt', 0.015), ('player', 0.015), ('sofer', 0.015), ('vanilla', 0.015), ('guaranteed', 0.015), ('achieves', 0.015), ('dt', 0.015), ('von', 0.014), ('lanckriet', 0.014), ('maximize', 0.014), ('tuning', 0.014), ('xs', 0.014), ('strategy', 0.013), ('bach', 0.013), ('kivinen', 0.013), ('maximizes', 0.013), ('cients', 0.013), ('nds', 0.013), ('produces', 0.013), ('abscissa', 0.013), ('hettich', 0.013), ('catches', 0.013), ('assures', 0.013), ('arcing', 0.013), ('minr', 0.013), ('yr', 0.013), ('game', 0.013), ('bounds', 0.012), ('continues', 0.012), ('nal', 0.012), ('convex', 0.012), ('quinlan', 0.011), ('ordinate', 0.011), ('koltchinskii', 0.011), ('daubechies', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

2 0.11742266 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps

3 0.090895452 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

Author: Roni Khardon, Rocco A. Servedio

Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions

4 0.088018753 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

5 0.057201542 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

Author: Savina Andonova Jaeger

Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes

6 0.052110225 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

7 0.042955205 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

8 0.035874426 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

9 0.035553325 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

10 0.033935297 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

11 0.032284528 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

12 0.029402213 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

13 0.029227557 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

14 0.025638096 71 jmlr-2005-Variational Message Passing

15 0.025497377 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

16 0.024379959 11 jmlr-2005-Algorithmic Stability and Meta-Learning

17 0.023391698 32 jmlr-2005-Expectation Consistent Approximate Inference

18 0.023168517 22 jmlr-2005-Concentration Bounds for Unigram Language Models

19 0.022870155 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

20 0.022408927 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.154), (1, -0.049), (2, 0.146), (3, 0.13), (4, -0.235), (5, -0.243), (6, 0.012), (7, -0.041), (8, 0.136), (9, 0.216), (10, -0.017), (11, -0.075), (12, 0.049), (13, -0.087), (14, 0.084), (15, -0.294), (16, -0.086), (17, -0.194), (18, -0.053), (19, 0.062), (20, -0.102), (21, -0.184), (22, 0.048), (23, 0.05), (24, 0.162), (25, 0.013), (26, -0.118), (27, 0.032), (28, -0.041), (29, 0.04), (30, 0.089), (31, -0.083), (32, -0.017), (33, 0.14), (34, -0.118), (35, 0.026), (36, 0.011), (37, -0.089), (38, -0.133), (39, 0.092), (40, -0.037), (41, 0.012), (42, -0.124), (43, 0.002), (44, -0.101), (45, -0.019), (46, -0.033), (47, 0.181), (48, -0.247), (49, -0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97434801 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

2 0.45239794 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps

3 0.42724633 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

Author: Roni Khardon, Rocco A. Servedio

Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions

4 0.29719636 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

5 0.286755 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

Author: Savina Andonova Jaeger

Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes

6 0.16655704 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

7 0.15665884 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

8 0.15424164 11 jmlr-2005-Algorithmic Stability and Meta-Learning

9 0.12986895 71 jmlr-2005-Variational Message Passing

10 0.12814164 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

11 0.12040151 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

12 0.11992002 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

13 0.11360277 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

14 0.10294592 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

15 0.097631395 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

16 0.091914311 48 jmlr-2005-Learning the Kernel Function via Regularization

17 0.090910286 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

18 0.0901509 72 jmlr-2005-What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks

19 0.089484312 41 jmlr-2005-Kernel Methods for Measuring Independence

20 0.087513991 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.014), (17, 0.02), (19, 0.066), (36, 0.017), (37, 0.021), (42, 0.011), (43, 0.034), (47, 0.02), (52, 0.066), (59, 0.025), (70, 0.025), (88, 0.084), (90, 0.029), (94, 0.458)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86541307 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

2 0.83420706 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

Author: Fabio Aiolli, Alessandro Sperduti

Abstract: Winner-take-all multiclass classifiers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classified with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to find locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efficient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a significant reduction (of one or two orders) in response time. Keywords: multiclass classification, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classifiers

3 0.83288169 5 jmlr-2005-A Generalization Error for Q-Learning

Author: Susan A. Murphy

Abstract: Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function. Keywords: multistage decisions, dynamic programming, reinforcement learning, batch data

4 0.67739755 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space

5 0.46994352 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps

6 0.38741565 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

7 0.38303959 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

8 0.38202077 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

9 0.3765713 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

10 0.36963102 11 jmlr-2005-Algorithmic Stability and Meta-Learning

11 0.36667532 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

12 0.36458242 67 jmlr-2005-Stability of Randomized Learning Algorithms

13 0.36080253 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

14 0.3532967 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

15 0.34191293 54 jmlr-2005-Managing Diversity in Regression Ensembles

16 0.33989042 48 jmlr-2005-Learning the Kernel Function via Regularization

17 0.33984867 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

18 0.33138841 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

19 0.32792991 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture

20 0.32635096 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve