jmlr jmlr2005 jmlr2005-57 knowledge-graph by maker-knowledge-mining

57 jmlr-2005-Multiclass Boosting for Weak Classifiers


Source: pdf

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. [sent-8, score-0.332]

2 These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost. [sent-11, score-0.213]

3 An exponential decrease of an upper bound of the training error rate is guaranteed as long as the error rates of the base classifiers are less than 1/2. [sent-25, score-0.311]

4 Freund and Schapire overcame this problem with the introduction of the pseudo-loss of a classifier h : X ×Y → [0, 1] : εt = 1 2 1 − ht (xi , yi ) + 1 ∑ ht (xi , y) . [sent-27, score-0.73]

5 M2, each base classifier has to minimize the pseudo-loss instead of the error rate. [sent-29, score-0.176]

6 As long as the pseudo-loss is less than 1/2, which is easily reachable for weak base classifiers as decision stumps, an exponential decrease of an upper bound on the training error rate is guaranteed. [sent-30, score-0.337]

7 In this paper, we will derive two new direct algorithms for multiclass problems with decision stumps as base classifiers. [sent-31, score-0.288]

8 We introduce the maxlabel error rate and derive bounds on it. [sent-38, score-0.268]

9 For both algorithms the goal of the base classifier is to minimize the confidence-rated error rate which makes them applicable for a wide range of already existing base classifiers. [sent-40, score-0.31]

10 A boosting algorithm calls a given weak classification algorithm h repeatedly in a series of rounds t = 1, . [sent-46, score-0.211]

11 The final classifier H is a weighted majority vote of the T weak classifiers ht where αt is the weight assigned to ht . [sent-52, score-0.729]

12 (1998, 1999) embedded AdaBoost in a more general theory which sees boosting algorithms as gradient descent methods for the minimization of a loss function in function space. [sent-57, score-0.22]

13 Based on the gradient descent framework, we derive a gradient descent algorithm for these loss functions in a straight forward way in Section 2. [sent-60, score-0.191]

14 Finally, the algorithm is simplified for the special case of decision stumps as base classifiers. [sent-68, score-0.226]

15 We are considering a function space F = lin(H ) consisting of functions f : X → R of the form T f (x; α, β) = ∑ αt ht (x; βt ), t=1 ht : X → {±1} with α = (α1 , . [sent-75, score-0.688]

16 The parameters βt uniquely determine ht therefore α and β uniquely determine f . [sent-82, score-0.344]

17 However −∇L( f )(x) is not necessarily an element of F , so we replace it by an element ht of F which is as parallel to −∇L( f )(x) as possible. [sent-87, score-0.344]

18 Using this inner product we can set βt := arg max −∇L( ft−1 ), h(· ; β) β and ht := h(· ; βt ). [sent-90, score-0.382]

19 Now we consider slightly more general exponential loss functions l( f , i) = exp [v( f , i)] with exponent − loss v( f , i) = v0 + ∑ vy (i) f (xi , y) , y where the choice v0 = 1 and vy (i) = 2 1 −2 1 2(|Y |−1) y = yi y = yi leads to the loss function (1). [sent-100, score-0.296]

20 ————————————————————————————– Input: training set S, maximum number of boosting rounds T 1 Initialisation: f0 := 0, t := 1, ∀i : D1 (i) := N . [sent-103, score-0.199]

21 , T do • ht = arg min ∑i Dt (i)v(h, i) h • If ∑i Dt (i)v(ht , i) ≥ v0 : T := t − 1, goto output. [sent-107, score-0.407]

22 i (iii) The sampling distribution can be updated in a similar way as in AdaBoost using the rule Dt+1 (i) = 1 Dt (i)l(αt ht , i), Zt where we define Zt as a normalization constant Zt := ∑ Dt (i)l(αt ht , i), i which ensures that the update Dt+1 is a distribution. [sent-111, score-0.787]

23 In contrast to the general framework, the algorithm uses a simple update rule for the sampling distribution as it exists for the original boosting algorithms. [sent-112, score-0.184]

24 The proof basically consists of three steps: the calculation of the gradient, the choice for base classifier ht together with the stopping criterion and the update rule for the sampling distribution. [sent-116, score-0.565]

25 l( f , i)vy (i) x = xi (2) Now we insert (2) into −∇L( ft−1 ), ht and get −∇L( ft−1 ), ht = − 1 1 ∑ ∑ l( ft−1 , i)vy (i)h(xi , y) = − N ∑ l( ft−1 , i)(v(h, i) − v0). [sent-121, score-0.742]

26 N i y i (3) If we define the sampling distribution Dt up to a positive constant Ct−1 by Dt (i) := Ct−1 l( ft−1 , i), (4) we can write (3) as −∇L( ft−1 ), ht = − 1 1 Dt (i)(v(h, i) − v0 ) = − Dt (i)v(h, i) − v0 Ct−1 N ∑ Ct−1 N ∑ i . [sent-122, score-0.365]

27 i Since we require Ct−1 to be positive, we get the choice of ht of the algorithm ht = arg max −∇L( ft−1 ), h = arg min ∑ Dt (i)v(h, i). [sent-123, score-0.788]

28 h h i (ii) One can verify the stopping criterion of Figure 2 from (5) −∇L( ft−1 ), ht ≤ 0 ⇔ ∑ Dt (i)v(ht , i) ≥ v0 . [sent-124, score-0.382]

29 Dt+1 (i) = Ct l( ft , i) = Ct l( ft−1 + αt ht , i) Ct = Ct l( ft−1 , i)l(αt ht , i) = Dt (i)l(αt ht , i). [sent-126, score-1.179]

30 Ct−1 194 (5) M ULTICLASS B OOSTING FOR W EAK C LASSIFIERS This means that the new weight of example i is a constant multiplied with Dt (i)l(αt ht , i). [sent-127, score-0.344]

31 2 Choice of αt and Resulting Algorithm GrPloss The algorithm above leaves the step length αt , which is the weight of the base classifier ht , unspecified. [sent-131, score-0.458]

32 |Y | − 1 y=k The corresponding training error rate is denoted by plerr: plerr := 1 N ∑I N i=1 f (xi , yi ) < 1 ∑ f (xi , y) . [sent-134, score-0.27]

33 (i) Similar to Schapire and Singer (1999) we first bound plerr by the product of the normalization constants T plerr ≤ ∏ Zt . [sent-141, score-0.348]

34 , |Y |}, weak classification algorithm with output h : X ×Y → [0, 1] Optionally T : maximal number of boosting rounds 1 Initialization: D1 (i) = N . [sent-149, score-0.211]

35 , T : • Train the weak classification algorithm ht with distribution Dt , where ht should maximize Ut := ∑i Dt (i)u(ht , i). [sent-153, score-0.729]

36 N i t=1 plerr ≤ (ii) Derivation of αt : Now we derive αt by minimizing the upper bound (6). [sent-161, score-0.177]

37 If we assume that each classifier ht fulfills Ut ≥ δ, we finally get T ∏ Zt ≤ e−δ T /2 . [sent-170, score-0.368]

38 Theorem 2 If for all base classifiers ht : X ×Y → [0, 1] of the algorithm GrPloss given in Figure 3 Ut := ∑ Dt (i)u(ht , i) ≥ δ i holds for δ > 0 then the pseudo-loss error of the training set fulfills T plerr ≤ ∏ 2 T /2 1 −Ut2 ≤ e−δ t=1 . [sent-174, score-0.666]

39 Since proportions are in the interval [0, 1] and for each of the two subsets the sum of proportions is one our decision stumps have both the former and the latter property (10). [sent-182, score-0.188]

40 i=1 (ii) Upper bound on the pseudo-loss error: Now we plug αt in (12) and get T plerr ≤ ∏ rt t=1 1 − rt rt (|Y | − 1) (|Y |−1)/|Y | + (1 − rt ) rt (|Y | − 1) 1 − rt 1/|Y | . [sent-186, score-1.455]

41 (13) (iii) Stopping criterion: As expected for rt = 1/|Y | the corresponding factor is 1. [sent-187, score-0.209]

42 The stopping criterion Ut ≤ 0 can be directly translated into rt ≥ 1/|Y |. [sent-188, score-0.247]

43 Looking at the first and second derivative of the bound one can easily verify that it has a unique maximum at rt = 1/|Y |. [sent-189, score-0.245]

44 Therefore, the bound drops as long as rt > 1/|Y |. [sent-190, score-0.245]

45 Note again that since rt = 1/|Y | is a unique maximum we get a formal decrease of the bound even when rt > 1/|Y |. [sent-191, score-0.495]

46 M1 the weight of a base classifier α is a function of the error rate, so we tried to modify this function so that it gets positive, if the error rate is less than the error rate of random guessing. [sent-203, score-0.268]

47 Here we want to increase the weights for instances, where the base classifier h : 199 E IBL AND P FEIFFER X × Y → [0, 1] performs worse than the uninformative or what we call the maxlabel rule. [sent-209, score-0.354]

48 The maxlabel rule labels each instance as the most frequent label. [sent-210, score-0.217]

49 As a confidence-rated classifier, the uninformative rule has the form maxlabel rule : X ×Y → [0, 1] : h(x, y) := Ny , N where Ny is the number of instances in the training set with label y. [sent-211, score-0.338]

50 A classifier f : X ×Y → [0, 1] makes a maxlabel error in classifying an example x with label k, if f (x, k) < c. [sent-217, score-0.249]

51 The maxlabel error for the training set is called mxerr: mxerr := 1 N ∑ I ( f (xi , yi ) < c) . [sent-218, score-0.503]

52 N i=1 The maxlabel error counts the proportion of elements of the training set for which the confidence f (x, k) in the right label is smaller than c. [sent-219, score-0.278]

53 The higher c is, the higher is the maxlabel error for the same classifier f ; therefore to get a weak error measure we set c very low. [sent-221, score-0.333]

54 When we use decision stumps as base classifiers we have the property h(x, y) ∈ [0, 1]. [sent-223, score-0.226]

55 As for GrPloss the modus operandi consists of finding an upper bound on mxerr and minimizing the bound with respect to α. [sent-229, score-0.274]

56 (i) Bound of mxerr in terms of the normalization constants Zt : Similar to the calculations used to bound the pseudo-loss error we begin by bounding mxerr in terms of the normalization constants Zt : We have 1 = ∑ Dt+1 (i) = ∑ Dt (i) i = i e−αt (ht (xi ,yi )−c) = . [sent-230, score-0.538]

57 , T : • Train the weak classification algorithm ht with distribution Dt , where ht should maximize rt = ∑ Dt (i)ht (xi , yi ) i • If rt ≤ c: goto output with T := t − 1 • Set (1 − c)rt . [sent-245, score-1.214]

58 c(1 − rt ) αt = ln • Update D: Dt+1 (i) = Dt (i) e−αt (ht (xi ,yi )−c) . [sent-246, score-0.209]

59 , αT and set the final classifier H(x): T ∑ αt ht (x, y) H(x) = arg max f (x, y) = arg max y∈Y y∈Y t=1 ————————————————————————————————— Figure 4: Algorithm BoostMA So we get −( f (xi ,yi )−c ∑ αt ) 1 ∏ Zt = N ∑ e t t . [sent-250, score-0.444]

60 (14) i Using f (xi , yi ) 1 (15) t and (14) we get mxerr ≤ ∏ Zt . [sent-251, score-0.268]

61 (17) E IBL AND P FEIFFER Now we use the convexity of e−αt (ht (xi ,yi )−c) for ht (xi , yi ) between 0 and 1 and the definition rt := ∑ Dt (i)ht (xi , yi ) i and get mxerr ≤ = ∏ ∑ Dt (i) t ∏ t ht (xi , yi )e−αt (1−c) + (1 − ht (xi , yi ))eαt c i rt e−αt (1−c) + (1 − rt )eαt c . [sent-254, score-2.053]

62 c(1 − rt ) αt = ln (iii) First bound on mxerr: To get the bound on mxerr we substitute our choice for αt in (17) and get mxerr ≤ ∏ t Now we bound the term c(1−rt ) (1−c)rt c (1 − c)rt c(1 − rt ) ht (xi ,yi ) ∑ Dt (i) i ht (xi ,yi ) c(1 − rt ) (1 − c)rt . [sent-256, score-1.875]

63 (18) by use of the inequality xa ≤ 1 − a + ax for x ≥ 0 and a ∈ [0, 1], which comes from the convexity of xa for a between 0 and 1 and get c(1 − rt ) (1 − c)rt ht (xi ,yi ) ≤ 1 − ht (xi , yi ) + ht (xi , yi ) c(1 − rt ) . [sent-257, score-1.558]

64 (1 − c)rt Substitution in (18) and simplifications lead to rtc (1 − rt )1−c (1 − c)1−c cc mxerr ≤ ∏ t . [sent-258, score-0.431]

65 (19) The factors of this bound are symmetric around rt = c and take their maximum of 1 there. [sent-259, score-0.245]

66 Therefore if rt > c is valid the bound on mxerr decreases. [sent-260, score-0.447]

67 (iv) Exponential decrease of mxerr: To prove the second bound we set rt = c + δ with δ ∈ (0, 1 − c) and rewrite (19) as mxerr ≤ ∏ 1 − t δ 1−c 1−c 1+ δ c c . [sent-261, score-0.464]

68 t Using 1 + x ≤ ex for x ≤ 0 leads to 2 mxerr ≤ e−δ T . [sent-265, score-0.202]

69 Theorem 3 If all base classifiers ht with ht (x, y) ∈ [0, 1] fulfill rt := ∑ Dt (i)ht (xi , yi ) ≥ c + δ i for δ ∈ (0, 1 − c) (and the condition c ∈ (0, 1)) then the maxlabel error of the training set for the algorithm in Figure 4 fulfills mxerr ≤ ∏ t rtc (1 − rt )1−c (1 − c)1−c cc 2 ≤ e−δ T . [sent-267, score-1.743]

70 ) Choice of c for BoostMA: since we use confidence-rated base classification algorithms we choose the training accuracy for the confidence-rated uninformative rule for c, which leads to c= Ny 1 N Nyi 1 ∑ =N∑ ∑ N =∑ N i=1 N y i; yi =y y∈Y 2 Ny N . [sent-269, score-0.258]

71 From ∑ f (x, y) = ∑ ∑ αt ht (x, y) = ∑ αt (1 − ht (x, k)) = ∑ αt − f (x, k) y=k t y=k t t we get f (x, k) < 1 1 f (x, k) ∑ f (x, y) ⇔ ∑ αt < |Y | . [sent-272, score-0.712]

72 |Y | − 1 y=k (22) t That means that if we choose c = 1/|Y | for BoostMA the maxlabel error is the same as the pseudoloss error. [sent-273, score-0.23]

73 Summarizing these two results we can say that for base classifiers with the normalization property, the choice (21) for c of BoostMA and data sets with balanced labels, the two algorithms GrPloss and BoostMA and their error measures are the same. [sent-276, score-0.25]

74 ) In contrast to GrPloss the algorithm does not change when the base classifier additionally fulfills the normalization property (10) because the algorithm only uses ht (xi , yi ). [sent-278, score-0.53]

75 204 M ULTICLASS B OOSTING FOR W EAK C LASSIFIERS We also set a maximum number of 2000 boosting rounds to stop the algorithm if the stopping criterion is not satisfied. [sent-293, score-0.227]

76 We have three different bounds on the pseudo-loss error of Grploss: the term ∏ Zt (24) t which was derived in the first part of the proof of Theorem 2, the tighter bound (9) of Theorem 2 and the bound (13) for the special case of decision stumps as base classifiers. [sent-303, score-0.379]

77 In Section 3, we derived two upper bounds on the maxlabel error for BoostMA: term (24) and the tighter bound (20) of Theorem 3. [sent-304, score-0.309]

78 M2, the bounds on the pseudo-loss error of GrPloss and the maxlabel error of BoostMA are below 1 for all data sets and all boosting rounds. [sent-309, score-0.401]

79 As expected, bound (13) derived for the special case of decision stumps as base classifiers on the pseudo-loss error is smaller than bound (9) of Theorem 2 which doesn’t use the normalization property (10) of the decision stumps. [sent-312, score-0.389]

80 For BoostMA, term (24) is a bound on the maxlabel error and for GrPloss term (24) is a bound on the pseudo-loss error. [sent-314, score-0.302]

81 For unbalanced data sets, the maxlabel error is the “harder” error measure than the pseudo-loss error, so for these data sets bound (24) is higher for BoostMA than for GrPloss. [sent-315, score-0.379]

82 For balanced data sets the maxlabel error and the pseudo-loss error are the same. [sent-316, score-0.336]

83 For the subsequent comparisons we take 205 E IBL AND P FEIFFER Database car * digitbreiman letter nursery * optdigits pendigits satimage * segmentation vehicle vowel waveform yeast * AdaBoost. [sent-324, score-0.543]

84 0 GrPloss pseudo-loss error [%] plerr BD24 BD13 BD9 0 3. [sent-347, score-0.179]

85 6 Table 2: Performance measures and their bounds in percent at the boosting round with minimal training error. [sent-412, score-0.183]

86 trerr, BD23: training error of AdaBoost and its bound (23); plerr, BD24 ,BD13 ,BD9: pseudo-loss error of GrPloss and its bounds (24), (13) and (9); mxerr, BD24, BD20: maxlabel error of BoostMA and its bounds (24) and (20). [sent-413, score-0.407]

87 all error rates at the boosting round with minimal training error rate as was done by Eibl and Pfeiffer (2002). [sent-414, score-0.261]

88 Only for the data sets digitbreiman and yeast do the training and the test error favor different algorithms (Table 4). [sent-421, score-0.176]

89 The reason for this might be the fact that bound (13) on the pseudo-loss error of GrPloss is tighter than bound (23) on the training error of AdaBoost. [sent-428, score-0.202]

90 We wish to make further 206 M ULTICLASS B OOSTING FOR W EAK C LASSIFIERS Database car * digitbreiman letter nursery * optdigits pendigits satimage * segmentation vehicle vowel waveform yeast * AdaM2 0 25. [sent-440, score-0.543]

91 47 Table 3: Training and test error at the boosting round with minimal training error; bold and italic numbers correspond to high(>5%) and medium(>1. [sent-502, score-0.203]

92 5%) differences to the smallest of the three error rates Database car * digitbreiman letter nursery * optdigits pendigits satimage * segmentation vehicle vowel waveform yeast * total trerr o + + o + + 4-2-6 GrPloss vs. [sent-503, score-0.621]

93 AdaM2 testerr plerr speed o o + + + + + + + + + + o o + + + + + o + + + o + + + + 4-2-6 8-4-0 10-0-2 Table 4: Comparison of GrPloss with AdaBoost. [sent-504, score-0.179]

94 To be more precise, we look at the number of boosting rounds needed to achieve 90% of the total decrease of the training error rate. [sent-550, score-0.254]

95 M2 needs more boosting rounds than GrPloss, so GrPloss seems to lead to a faster decrease in the training error rate (Table 4). [sent-552, score-0.274]

96 Besides the number of boosting rounds, the time for the algorithm is also heavily influenced by the time needed to construct a base classifier. [sent-553, score-0.229]

97 M2 is not as straightforward as the maximization of rt required for GrPloss and BoostMA. [sent-556, score-0.209]

98 208 M ULTICLASS B OOSTING FOR W EAK C LASSIFIERS Database car * nursery * satimage * yeast * total trerr + + + + 4-0-0 GrPloss vs. [sent-558, score-0.243]

99 Conclusion We proposed two new algorithms GrPloss and BoostMA for multiclass problems with weak base classifiers. [sent-560, score-0.217]

100 The algorithms are designed to minimize the pseudo-loss error and the maxlabel error respectively. [sent-561, score-0.292]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('grploss', 0.525), ('ht', 0.344), ('boostma', 0.323), ('dt', 0.241), ('zt', 0.237), ('rt', 0.209), ('mxerr', 0.202), ('maxlabel', 0.192), ('ft', 0.147), ('plerr', 0.141), ('boosting', 0.115), ('base', 0.114), ('feiffer', 0.111), ('ibl', 0.111), ('eak', 0.101), ('stumps', 0.089), ('oosting', 0.085), ('lassifiers', 0.085), ('ulticlass', 0.085), ('unbalanced', 0.075), ('ut', 0.073), ('eibl', 0.071), ('vy', 0.068), ('balanced', 0.068), ('multiclass', 0.062), ('ct', 0.062), ('digitbreiman', 0.061), ('satimage', 0.061), ('nursery', 0.059), ('rounds', 0.055), ('descent', 0.051), ('optdigits', 0.051), ('pendigits', 0.051), ('vowel', 0.051), ('yeast', 0.048), ('uninformative', 0.048), ('ers', 0.045), ('er', 0.043), ('yi', 0.042), ('weak', 0.041), ('classi', 0.041), ('karl', 0.04), ('mason', 0.04), ('pfeiffer', 0.04), ('trerr', 0.04), ('stopping', 0.038), ('error', 0.038), ('arg', 0.038), ('proportions', 0.038), ('vehicle', 0.038), ('schapire', 0.036), ('bound', 0.036), ('car', 0.035), ('gradient', 0.035), ('waveform', 0.034), ('freund', 0.031), ('nther', 0.03), ('normalization', 0.03), ('xi', 0.03), ('training', 0.029), ('segmentation', 0.028), ('adaboost', 0.027), ('lls', 0.027), ('ful', 0.027), ('letter', 0.026), ('goto', 0.025), ('tighter', 0.025), ('rule', 0.025), ('robert', 0.024), ('minimize', 0.024), ('get', 0.024), ('decision', 0.023), ('update', 0.023), ('database', 0.022), ('round', 0.021), ('sampling', 0.021), ('peter', 0.021), ('yoav', 0.021), ('dietterrich', 0.02), ('ekvy', 0.02), ('guruswami', 0.02), ('innsbruck', 0.02), ('llew', 0.02), ('rtc', 0.02), ('takashi', 0.02), ('testerr', 0.02), ('uibk', 0.02), ('rate', 0.02), ('ny', 0.02), ('ensemble', 0.02), ('exponential', 0.019), ('loss', 0.019), ('label', 0.019), ('stop', 0.019), ('speed', 0.018), ('dence', 0.018), ('bounds', 0.018), ('databases', 0.018), ('justify', 0.018), ('decrease', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps

2 0.11742266 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

3 0.076274075 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.

4 0.059519276 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

5 0.042407941 67 jmlr-2005-Stability of Randomized Learning Algorithms

Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil

Abstract: We extend existing theory on stability, namely how much changes in the training data influence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal definitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.1 Keywords: stability, randomized learning algorithms, sensitivity analysis, bagging, bootstrap methods, generalization error, leave-one-out error.

6 0.038414031 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

7 0.038153578 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

8 0.03471246 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

9 0.034532193 32 jmlr-2005-Expectation Consistent Approximate Inference

10 0.034279261 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

11 0.03178063 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems

12 0.031413361 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

13 0.031186333 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

14 0.030863456 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

15 0.030827845 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

16 0.029107088 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

17 0.027271666 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

18 0.025769733 64 jmlr-2005-Semigroup Kernels on Measures

19 0.024312766 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

20 0.023594363 54 jmlr-2005-Managing Diversity in Regression Ensembles


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.144), (1, -0.013), (2, 0.151), (3, 0.029), (4, -0.238), (5, -0.194), (6, 0.081), (7, 0.008), (8, 0.081), (9, 0.202), (10, -0.123), (11, -0.06), (12, 0.036), (13, -0.168), (14, 0.114), (15, -0.114), (16, -0.153), (17, -0.256), (18, -0.03), (19, 0.189), (20, -0.215), (21, -0.075), (22, 0.005), (23, 0.122), (24, 0.038), (25, 0.107), (26, -0.075), (27, 0.067), (28, -0.031), (29, -0.005), (30, 0.031), (31, 0.209), (32, 0.209), (33, -0.036), (34, -0.054), (35, 0.103), (36, -0.109), (37, -0.152), (38, -0.049), (39, -0.047), (40, 0.044), (41, 0.073), (42, 0.056), (43, -0.051), (44, -0.07), (45, 0.088), (46, -0.013), (47, -0.158), (48, 0.239), (49, -0.102)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9595567 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps

2 0.43607891 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

3 0.28542063 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.

4 0.21426433 67 jmlr-2005-Stability of Randomized Learning Algorithms

Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil

Abstract: We extend existing theory on stability, namely how much changes in the training data influence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal definitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.1 Keywords: stability, randomized learning algorithms, sensitivity analysis, bagging, bootstrap methods, generalization error, leave-one-out error.

5 0.213709 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

Author: John Langford

Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds

6 0.16391346 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

7 0.15156125 3 jmlr-2005-A Classification Framework for Anomaly Detection

8 0.14850503 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

9 0.13276909 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

10 0.12119871 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems

11 0.12049061 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

12 0.11551823 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

13 0.11499064 32 jmlr-2005-Expectation Consistent Approximate Inference

14 0.11353745 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

15 0.11126576 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

16 0.099484287 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

17 0.098629437 61 jmlr-2005-Prioritization Methods for Accelerating MDP Solvers

18 0.096188523 64 jmlr-2005-Semigroup Kernels on Measures

19 0.093192719 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

20 0.092667609 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.019), (19, 0.021), (36, 0.017), (37, 0.018), (42, 0.012), (43, 0.02), (47, 0.023), (52, 0.059), (59, 0.015), (70, 0.023), (83, 0.475), (88, 0.08), (90, 0.043), (94, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70496815 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps

2 0.27239922 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

Author: Fabio Aiolli, Alessandro Sperduti

Abstract: Winner-take-all multiclass classifiers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classified with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to find locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efficient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a significant reduction (of one or two orders) in response time. Keywords: multiclass classification, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classifiers

3 0.27231222 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space

4 0.27072799 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

5 0.27072698 5 jmlr-2005-A Generalization Error for Q-Learning

Author: Susan A. Murphy

Abstract: Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function. Keywords: multistage decisions, dynamic programming, reinforcement learning, batch data

6 0.23960839 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

7 0.238208 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

8 0.23663004 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

9 0.2350696 11 jmlr-2005-Algorithmic Stability and Meta-Learning

10 0.23219898 41 jmlr-2005-Kernel Methods for Measuring Independence

11 0.22818008 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

12 0.22736506 67 jmlr-2005-Stability of Randomized Learning Algorithms

13 0.22631435 48 jmlr-2005-Learning the Kernel Function via Regularization

14 0.22498511 39 jmlr-2005-Information Bottleneck for Gaussian Variables

15 0.22461587 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

16 0.22436768 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

17 0.22424406 3 jmlr-2005-A Classification Framework for Anomaly Detection

18 0.22388452 71 jmlr-2005-Variational Message Passing

19 0.22381108 36 jmlr-2005-Gaussian Processes for Ordinal Regression

20 0.22244951 20 jmlr-2005-Clustering with Bregman Divergences