jmlr jmlr2013 jmlr2013-8 knowledge-graph by maker-knowledge-mining

8 jmlr-2013-A Theory of Multiclass Boosting


Source: pdf

Author: Indraneel Mukherjee, Robert E. Schapire

Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. [sent-7, score-0.704]

2 In particular, the exact requirements on the weak classifiers in this setting are known: any algorithm that predicts better than random on any distribution over the training set is said to satisfy the weak learning assumption. [sent-13, score-0.622]

3 In particular, we do not know the “correct” way to define the requirements on the weak classifiers, nor has the notion of optimal boosting been explored in the multiclass setting. [sent-18, score-0.704]

4 Requiring less error than random guessing on every distribution, as in the binary case, turns out to be too weak for boosting to be possible when there are more than two labels. [sent-20, score-0.559]

5 In this paper, we create a broad and general framework for studying multiclass boosting that formalizes the interaction between the boosting algorithm and the weak-learner. [sent-29, score-0.619]

6 Unlike much, but not all, of the previous work on multiclass boosting, we focus specifically on the most natural, and perhaps weakest, case in which the weak classifiers are genuine classifiers in the sense of predicting a single multiclass label for each instance. [sent-30, score-0.618]

7 (2009) is too weak in the sense that even when the condition is satisfied, no boosting algorithm can guarantee to drive down the training error. [sent-47, score-0.637]

8 We also analyze the optimal boosting strategy when using the minimal weak learning condition, and this poses additional challenges. [sent-58, score-0.638]

9 The condition therefore states that a weak classifier should not exceed the average cost when weighted according to baseline B. [sent-164, score-0.515]

10 By studying this vast class of weak-learning conditions, we hope to find the one that will serve the main purpose of the boosting game: finding a convex combination of weak classifiers that has zero training error. [sent-169, score-0.56]

11 Formally, a collection H of weak classifiers is boostable if it is eligible for boosting in the sense that there exists a weighting λ on the votes forming a distribution that linearly separates the data: ∀i : argmaxl∈{1,. [sent-171, score-0.626]

12 Ideally, the second factor will not cause the weak-learning condition to impose additional restrictions on the weak classifiers; in that case, the weak-learning condition is merely a reformulation of being boostable that is more appropriate for deriving an algorithm. [sent-177, score-0.543]

13 Old Conditions In this section, we rewrite, in the language of our framework, the weak learning conditions explicitly or implicitly employed in the multiclass boosting algorithms SAMME (Zhu et al. [sent-184, score-0.721]

14 The implicit weak learning condition requires that for any matrix with non-negative entries d(i, l), the weak-hypothesis should achieve 1/2 + γ accuracy m ∑ i=1 1 [h(xi ) = yi ] d(i, yi ) + ∑ 1 [h(xi ) = l] d(i, l) l=yi ≤ 1 γ − 2 2 m k ∑ ∑ d(i, l). [sent-236, score-0.5]

15 In particular, define C eor dition is given by (C γ to be the multiclass extension of C bin : any cost-matrix in C eor should put the least cost on the correct label, that is, the rows of the cost-matrices should come from the set c ∈ Rk : ∀l, c(1) ≤ c(l) . [sent-307, score-1.001]

16 eor Then, for every baseline B ∈ Bγ , we introduce the condition (C eor , B), which we call an edgeover-random weak-learning condition. [sent-308, score-0.911]

17 Theorem 3 (Sufficiency) If a weak classifier space H satisfies a weak-learning condition (C eor , B), eor for some B ∈ Bγ , then H is boostable. [sent-314, score-1.166]

18 Applying Theorem 1 yields 0 ≥ max min C • (1h − B) = min max C • (Hλ − B) , eor eor C∈C h∈H λ∈∆(H ) C∈C where the first inequality follows from the definition (2) of the weak-learning condition. [sent-316, score-0.906]

19 Therefore, the convex combination of the weak classifiers, obtained by choosing each weak classifier with weight given by λ∗ , perfectly classifies the training data, in fact with a margin γ. [sent-320, score-0.652]

20 Theorem 4 (Relaxed necessity) For every boostable weak classifier space H , there exists a γ > 0 eor and B ∈ Bγ such that H satisfies the weak-learning condition (C eor , B). [sent-322, score-1.256]

21 Then eor B ∈ Bγ , and max min C • (1h − B) ≤ min max C • (Hλ − B) ≤ max C • (Hλ∗ − B) = 0, eor eor C∈C eor h∈H C∈C λ∈∆(H ) C∈C where the equality follows since by definition Hλ∗ − B = 0. [sent-327, score-1.728]

22 The max-min expression is at most zero is another way of saying that H satisfies the weak-learning condition (C eor , B) as in (2). [sent-328, score-0.472]

23 Theorem 4 states that any boostable weak classifier space will satisfy some condition in our family, but it does not help us choose the right condition. [sent-329, score-0.495]

24 Experiments in Section 10 suggest C eor , Uγ is effective with very simple weak-learners compared to popular boosting algorithms. [sent-330, score-0.632]

25 eor Theorem 5 For any B ∈ Bγ , there exists a boostable space H that fails to satisfy the condition eor , B). [sent-333, score-0.957]

26 (C eor Proof We provide, for any γ > 0 and edge-over-random baseline B ∈ Bγ , a data set and weak classifier space that is boostable but fails to satisfy the condition (C eor , B). [sent-334, score-1.3]

27 We want the following symmetries in our weak classifiers: • Each weak classifier correctly classifies ⌊m(1/2 + γ′ )⌋ examples and misclassifies the rest. [sent-340, score-0.598]

28 So the cost of B on the chosen cost-matrix is at most m(1/k − γ/k), which is less than the cost ⌈m(1/2 − γ′ )⌉ ≥ m(1/2 − γ/k) of any weak classifier whenever the number of labels k is more than two. [sent-362, score-0.472]

29 Hence our boostable space of weak classifiers fails to satisfy (C eor , B). [sent-363, score-0.784]

30 The kind of prior knowledge required to make this guess correctly is provided by Theorem 3: the appropriate weak learning condition is determined by the distribution of votes on the labels for each example that a target weak classifier combination might be able to get. [sent-366, score-0.716]

31 2 The Minimal Weak Learning Condition A perhaps extreme way of weakening the condition is by requiring the performance on a cost matrix eor to be competitive not with a fixed baseline B ∈ Bγ , but with the worst of them: ∀C ∈ C eor , ∃h ∈ H : C • 1h ≤ max C • B. [sent-370, score-1.009]

32 eor B∈Bγ (13) Condition (13) states that during the course of the same boosting game, Weak-Learner may choose eor to beat any edge-over-random baseline B ∈ Bγ , possibly a different one for every round and every cost-matrix. [sent-371, score-1.152]

33 Proof We will show the following three conditions are equivalent: (A) H is boostable (B) ∃γ > 0 such that ∀C ∈ C eor , ∃h ∈ H : C • 1h ≤ max C • B eor B∈Bγ (C) ∃γ > 0 such that ∀C ∈ C MR , ∃h ∈ H : C • 1h ≤ C • BMR . [sent-384, score-0.952]

34 451 M UKHERJEE AND S CHAPIRE h1 1 1 a b h2 2 2 Figure 1: A weak classifier space which satisfies SAMME’s weak learning condition but is not boostable. [sent-401, score-0.675]

35 In particular, when the cost matrix C ∈ C eor is given by a b 1 −1 +1 2 +1 −1 3 0 0, both classifiers in the above example suffer more loss than the random player Uγ , and fail to satisfy our condition. [sent-414, score-0.547]

36 When used in our framework, where the weak classifiers return only a single multiclass prediction per example, the implicit demands made by AdaBoost. [sent-420, score-0.476]

37 Instead, we construct a classifier space that satisfies the condition (C eor , Uγ ) in our family, but cannot satisfy AdaBoost. [sent-423, score-0.472]

38 Note that this does not imply that the conditions are too strong when used with more powerful weak classifiers that return multilabel multiclass predictions. [sent-425, score-0.51]

39 In such cases, a more appropriate definition is the weighted state ft ∈ Rk , tracking the weighted counts of votes received so far: t−1 ft (l) = ∑ αt 1 [ht (x) = l] . [sent-483, score-0.471]

40 When the weak learning condition being used is (C , B), Schapire (2001) proposed a Booster strategy, called the OS strategy, which always chooses the weight αt = 1, and uses the potential functions to construct a cost matrix Ct as follows. [sent-519, score-0.555]

41 Solving for Any Fixed Edge-over-random Condition In this section we show how to implement the OS strategy when the weak learning condition is eor any fixed edge-over-random condition: (C , B) for some B ∈ Bγ . [sent-534, score-0.808]

42 Then, we have φtb (s) = = = = El∼p [φt−1 (s + el )] max min eor c∈C0 p∈∆{1,. [sent-556, score-0.714]

43 eor p∈∆ c∈C0 Unless b(1) − p(1) ≤ 0 and b(l) − p(l) ≥ 0 for each l > 1, the quantity b − p, c can be made eor arbitrarily small for appropriate choices of c ∈ C0 . [sent-562, score-0.79]

44 Then notice that the weighted state ft of the examples, defined in (19), is related to the unweighted states 458 A T HEORY OF M ULTICLASS B OOSTING st as ft (l) = ηst (l). [sent-586, score-0.579]

45 Therefore the exponential loss function in (32) directly measures the loss of the weighted state as k Lexp (ft ) = ∑ e ft (l)− ft (1) . [sent-587, score-0.619]

46 In particular, when B = Uγ (so that the condition is (C eor , Uγ )), the relevant potential φt (s) or φt (f) is given by k k l=2 l=2 φt (s) = φt ( f ) = κ(γ, η)t ∑ eη(sl −s1 ) = κ(γ, η)t ∑ e fl − f1 (34) where κ(γ, η) = 1 + (1 − γ) η e + e−η − 2 − 1 − e−η γ. [sent-595, score-0.526]

47 To resolve this, we first show that the optimal boosting strategy based on any formulation of a necessary and sufficient weak learning condition is the same. [sent-648, score-0.673]

48 1 Game-theoretic Equivalence of Necessary and Sufficient Weak-learning Conditions In this section we study the effect of the weak learning condition on the game-theoretically optimal boosting strategy. [sent-679, score-0.636]

49 We introduce the notion of game-theoretic equivalence between two weak learning conditions, that determines if the payoffs (15) of the optimal boosting strategies based on the two conditions are identical. [sent-680, score-0.682]

50 461 M UKHERJEE AND S CHAPIRE Thus, a weak learning condition is essentially a family of subsets of the weak classifier space. [sent-691, score-0.675]

51 1 M ODIFIED P OTENTIALS AND OS S TRATEGY The condition in (13) is not stated as a single pair (C eor , B), but uses all possible edge-over-random eor baselines B ∈ Bγ . [sent-723, score-0.867]

52 Using these, define new potentials φt (s) as follows: min φt (s) = eor c∈C0 max max b∈∆k p∈∆{1,. [sent-726, score-0.564]

53 (38) The main difference between (38) and (17) is that while the older potentials were defined using a fixed vector b corresponding to some row in the fixed baseline B, the new definition takes the eor maximum over all possible rows b ∈ ∆k that an edge-over-random baseline B ∈ Bγ may have. [sent-732, score-0.562]

54 eor c∈C0 b∈∆k l=1 γ (39) where recall that st (i) denotes the state vector (defined in (18)) of example i. [sent-736, score-0.497]

55 Then the recurrence (40) may be simplified a as follows: φt (s) = max El∼b [φt−1 (s + el )] = max El∼bπ [φt−1 (s + el )] . [sent-786, score-0.674]

56 In other words, at a given state s with t rounds of boosting remaining, the parameter a determines the number of directions the optimal random walk will consider taking; we will therefore refer to a as the degree of the random walk given (s,t). [sent-819, score-0.536]

57 Therefore, in this situation, the potential φt using the minimal weak learning condition is the same as the potential φtu using the γ-biased uniform distribution u, 1−γ 1−γ 1−γ + γ, ,. [sent-931, score-0.526]

58 In order to show (49), by Lemma 18, it suffices to show that the optimal degree a maximizing the right hand side of (44) is always k: El∼bπ [φt−1 (s + el )] ≤ El∼bπ [φt−1 (s + el )] . [sent-941, score-0.575]

59 The above lemma seems to suggest that under certain conditions, a sort of degeneracy occurs, and the optimal Booster payoff (15) is nearly unaffected by whether we use the minimal weak learning condition, or the condition (C eor , Uγ ). [sent-961, score-0.866]

60 (51) ε Consider the minimal weak learning condition (13), and the fixed edge-over-random condition (C eor , Uγ ) corresponding to the γ-biased uniform baseline Uγ . [sent-964, score-0.934]

61 Lemma 19 states that the potential function using the minimal weak learning condition is the same as when using the fixed condition (C eor , Uγ ): φt = φtu , where u is as in (47). [sent-974, score-0.973]

62 eor B∈Bγ In order to show that the constraints are the same we therefore need to show that for the common cost matrix Ct chosen, the right hand side of the two previous expressions are the same: eor Ct • Uγ = max Ct • Bγ . [sent-978, score-0.888]

63 Therefore, the constraints placed on Weak-Learner by the cost-matrix Ct is the same whether we use minimum weak learning condition or the fixed condition (C eor , Uγ ). [sent-982, score-0.848]

64 Proof Comparing solutions (45) and (31) to the potentials corresponding to the minimal weak learning condition and a fixed edge-over-random condition, we may conclude that the loss bound φT (0) is in the former case is larger than φb (0), for any edge-over-random distribution b ∈ ∆k . [sent-995, score-0.551]

65 Therefore, Booster’s demands on the weak classifiers returned by Weak Learner should be minimal, and it should send the weak learning algorithm the “easiest” cost matrices that will ensure boostability. [sent-1017, score-0.721]

66 In turn, Weak Learner may only assume a very weak Booster strategy, and therefore return a weak classifier that performs as well as possible with respect to the cost matrix sent by Booster. [sent-1018, score-0.664]

67 This way the boosting algorithm remains robust to a poorly performing Weak Learner, and yet can make use of a powerful weak learning algorithm whenever possible. [sent-1026, score-0.536]

68 But then the loss function satisfies the conditions in Lemma 19, and by Theorem 20, the game theoretically optimal strategy remains the same whether we use the minimal condition or (C eor , Uγ ). [sent-1036, score-0.719]

69 Further, when using the condition (C eor , Uγ ), the average potential of the states ft (i), according to (34), is given by the average loss (37) of the state times κ(γ, η)T −t , where the function κ is defined in (35). [sent-1038, score-0.856]

70 Lemma 22 Consider the boosting game using the minimal weak learning condition (13). [sent-1045, score-0.706]

71 2 Adaptively Choosing Weights Once Weak Learner returns a weak classifier ht , Booster chooses the optimum weight αt so that the resulting states ft = ft−1 + αt 1ht are as favorable as possible, that is, minimizes the total potential of its states. [sent-1049, score-0.737]

72 By our previous discussions, these are proportional to the total loss given by Zt = ∑m ∑k e ft (i,l)− ft (i,1) . [sent-1050, score-0.502]

73 In fact the factor is exactly (1 − ct ) − ct2 − δt2 , where ct = (At+ + At− )/Zt−1 , 474 (58) A T HEORY OF M ULTICLASS B OOSTING and δt is the edge of the returned classifier ht on the supplied cost-matrix Ct . [sent-1055, score-0.494]

74 The definition of the edge depends on the weak learning condition being used, and in this case we are using the minimal condition (13). [sent-1058, score-0.554]

75 eor B∈Bγ However, since Ct is the optimal cost matrix when using exponential loss with a tiny value of η, we can use arguments in the proof of Theorem 20 to simplify the computation. [sent-1060, score-0.578]

76 Lemma 24 Suppose cost matrix Ct is chosen as in (56), and the returned weak classifier ht has edge δt , that is, Ct • 1ht ≤ Ct • Uδt . [sent-1068, score-0.521]

77 MM, shown in Algorithm 1, is the optimal strategy for playing the adaptive boosting game, and is based on the minimal weak learning condition. [sent-1093, score-0.667]

78 In particular, if a weak hypothesis space is used that satisfies the optimal weak learning condition (13), for some γ, then the edge in each round is large, δt ≥ γ, and therefore the error after T 2 rounds is exponentially small, (k − 1)e−T γ /2 . [sent-1098, score-0.95]

79 Both the choice of the minimal weak learning condition as well as the setup of the adaptive game framework will play crucial roles in ensuring consistency. [sent-1107, score-0.498]

80 In particular, we will require that in each round Weak Learner picks the weak classifier suffering minimum cost with respect to the cost matrix provided by the boosting algorithm, or equivalently achieves the highest edge as defined in (59). [sent-1155, score-0.779]

81 In particular, if the weak learning condition is stronger than necessary, then, even on a boostable data set where the error can be driven to zero, the boosting algorithm may get stuck prematurely because its stronger than necessary demands cannot be met by the weak classifier space. [sent-1220, score-1.034]

82 On the other hand, if the weak learning condition is too weak, then a lazy Weak Learner may satisfy the Booster’s demands by returning weak classifiers belonging only to a non-boostable subset of the available weak classifier space. [sent-1222, score-1.006]

83 In that case, any algorithm using SAMME’s weak learning condition, which is known to be too weak and satisfiable by just the two hypotheses {h1 , h2 }, would only receive h1 or h2 in each round, and therefore be unable to reach the optimum accuracy. [sent-1225, score-0.598]

84 Of course, if the Weak Learner is extremely generous and helpful, then it may return the right collection of weak classifiers even with a null weak learning condition that places no demands on it. [sent-1226, score-0.707]

85 The reason for using weak learners that optimize different cost functions with the different boosting algorithms is as follows. [sent-1260, score-0.602]

86 As predicted by our theory, our algorithm succeeds in boosting the accuracy even when the tree size is too small to meet the stronger weak learning assumptions of the other algorithms. [sent-1310, score-0.536]

87 This framework is very general and captures the weak learning conditions implicitly used by many earlier multiclass boosting algorithms as well as novel conditions, including the minimal condition under which boosting is possible. [sent-1317, score-1.077]

88 We also show how to design boosting algorithms relying on these weak learning conditions that drive down training error rapidly. [sent-1318, score-0.6]

89 1 is based upon a weak learner that may return any weak hypothesis, which is absurd from a practical viewpoint. [sent-1329, score-0.656]

90 However, designing optimal boosting algorithms separately for different kinds of weak learners, which we leave as an open problem, will lead to a much more complex theory. [sent-1330, score-0.559]

91 The final test-errors of the three algorithms after 500 rounds of boosting are plotted against the maximum tree-sizes allowed for the weak classifiers. [sent-1335, score-0.677]

92 Further, we primarily work with weak classifiers that output a single multiclass prediction per example, whereas weak hypotheses that make multilabel multiclass predictions are typically more powerful. [sent-1339, score-0.914]

93 Finally, it will be interesting to see if the notion of minimal weak learning condition can be extended to boosting settings beyond classification, such as ranking. [sent-1341, score-0.655]

94 Algorithms M1 and MH make strong demands which cannot be met by the extremely weak classifiers after a few rounds, whereas MM makes gentler demands, and is hence able to drive down error through all the rounds of boosting. [sent-1343, score-0.472]

95 The dual form of the recurrence (21) and the choice of the cost matrix Ct in (22) together ensure that for each example i, k B(i) B(i) φT −t (st (i)) = max φT −t−1 (st (i) + el ) − (Ct (i)(l) − Ct (i), B(i) ) ≥ l=1 B(i) φT −t−1 st (i) + eht (xi ) − (Ct (i, ht (xi )) − Ct (i), B(i) ) . [sent-1353, score-0.598]

96 We show that in any iteration t ≤ T , based on Booster’s choice of cost-matrix C, an adversary can choose a weak classifier ht ∈ H all such that the weak learning condition is satisfied, and the average potential does not fall by more than an amount ε/T . [sent-1358, score-0.801]

97 (75) To satisfy (75), by (17), we may choose pi as any optimal response of the max player in the minimax recurrence when the min player chooses C(i): pi B(i) ∈ argmax El∼p φt−1 (s + el ) (76) p∈Pi where Pi = {p ∈ ∆ {1, . [sent-1367, score-0.641]

98 Notice that for each example i and each sign-bit ξ ∈ {−1, +1}, we have the following relations: ξ ξ C(i, li ) = El∼pi [C(i, l)] − ξ(1 − pi )ai B(i) φT −t−1 = El∼pi st (i) + el ξ i B(i) φT −t (i, l) (79) ξ − ξ(1 − pi )bi . [sent-1398, score-0.488]

99 For any given multiclass data set and weak classifier space, we will obtain a transformed binary data set and weak classifier space, such that the run of AdaBoost. [sent-1473, score-0.766]

100 MM produces scoring function Fα when run for T rounds with the training set S and weak classifier space H , then AdaBoost produces the scoring function Fα when run for T rounds with the training set S and space H . [sent-1489, score-0.677]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eor', 0.395), ('weak', 0.299), ('el', 0.261), ('booster', 0.261), ('boosting', 0.237), ('ft', 0.224), ('ulticlass', 0.173), ('ct', 0.169), ('schapire', 0.167), ('chapire', 0.148), ('oosting', 0.148), ('ukherjee', 0.148), ('multiclass', 0.145), ('rounds', 0.141), ('heory', 0.123), ('riskd', 0.108), ('mh', 0.107), ('freund', 0.103), ('er', 0.1), ('bmr', 0.096), ('classi', 0.091), ('bmh', 0.09), ('boostable', 0.09), ('recurrence', 0.088), ('ers', 0.087), ('potentials', 0.079), ('st', 0.079), ('condition', 0.077), ('mr', 0.075), ('adaboost', 0.075), ('ht', 0.072), ('errd', 0.072), ('tb', 0.071), ('os', 0.068), ('cost', 0.066), ('samme', 0.066), ('yi', 0.062), ('bi', 0.061), ('boostability', 0.06), ('hopt', 0.06), ('sl', 0.06), ('pi', 0.059), ('edge', 0.059), ('learner', 0.058), ('potential', 0.054), ('loss', 0.054), ('mm', 0.053), ('round', 0.052), ('game', 0.051), ('ai', 0.048), ('games', 0.048), ('drifting', 0.046), ('baseline', 0.044), ('minimal', 0.042), ('rbt', 0.042), ('labels', 0.041), ('walk', 0.041), ('robert', 0.04), ('exponential', 0.04), ('conditions', 0.04), ('strategy', 0.037), ('lexp', 0.036), ('ubin', 0.036), ('mukherjee', 0.034), ('yoav', 0.034), ('demands', 0.032), ('ek', 0.032), ('player', 0.032), ('max', 0.032), ('payoffs', 0.031), ('li', 0.03), ('weight', 0.03), ('lemma', 0.03), ('degree', 0.03), ('satis', 0.029), ('singer', 0.029), ('adaptive', 0.029), ('states', 0.029), ('label', 0.029), ('chooses', 0.029), ('reachable', 0.028), ('equivalence', 0.027), ('min', 0.026), ('consistency', 0.026), ('eli', 0.026), ('multilabel', 0.026), ('diagonally', 0.026), ('strategies', 0.025), ('zt', 0.025), ('risk', 0.025), ('returned', 0.025), ('scoring', 0.024), ('lerr', 0.024), ('stumps', 0.024), ('training', 0.024), ('xl', 0.024), ('binary', 0.023), ('optimal', 0.023), ('state', 0.023), ('sam', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 8 jmlr-2013-A Theory of Multiclass Boosting

Author: Indraneel Mukherjee, Robert E. Schapire

Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games

2 0.24911797 114 jmlr-2013-The Rate of Convergence of AdaBoost

Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate

3 0.23938105 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

Author: Philip M. Long, Rocco A. Servedio

Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers

4 0.088665485 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

5 0.056752097 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

Author: Xin Tong

Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection

6 0.056477915 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

7 0.054703705 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

8 0.054241735 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

9 0.05305396 73 jmlr-2013-Multicategory Large-Margin Unified Machines

10 0.050428994 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

11 0.046993632 68 jmlr-2013-Machine Learning with Operational Costs

12 0.04413797 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

13 0.043854021 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

14 0.040908236 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

15 0.040734332 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs

16 0.040153954 90 jmlr-2013-Quasi-Newton Method: A New Direction

17 0.036177602 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

18 0.036042884 120 jmlr-2013-Variational Algorithms for Marginal MAP

19 0.035530962 76 jmlr-2013-Nonparametric Sparsity and Regularization

20 0.034482725 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.238), (1, 0.142), (2, 0.055), (3, 0.147), (4, -0.096), (5, -0.124), (6, 0.229), (7, -0.168), (8, -0.293), (9, -0.05), (10, 0.202), (11, 0.165), (12, -0.077), (13, -0.176), (14, -0.003), (15, -0.072), (16, 0.055), (17, 0.131), (18, -0.022), (19, 0.009), (20, 0.117), (21, -0.034), (22, -0.023), (23, -0.151), (24, 0.01), (25, 0.055), (26, 0.178), (27, 0.157), (28, -0.02), (29, 0.088), (30, -0.071), (31, -0.044), (32, 0.039), (33, -0.063), (34, -0.124), (35, -0.011), (36, 0.032), (37, -0.08), (38, 0.066), (39, -0.037), (40, -0.041), (41, -0.026), (42, -0.015), (43, -0.113), (44, -0.0), (45, -0.067), (46, 0.021), (47, -0.038), (48, 0.007), (49, -0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94340032 8 jmlr-2013-A Theory of Multiclass Boosting

Author: Indraneel Mukherjee, Robert E. Schapire

Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games

2 0.80631018 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

Author: Philip M. Long, Rocco A. Servedio

Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers

3 0.70798206 114 jmlr-2013-The Rate of Convergence of AdaBoost

Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate

4 0.32916811 73 jmlr-2013-Multicategory Large-Margin Unified Machines

Author: Chong Zhang, Yufeng Liu

Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine

5 0.3107374 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency

6 0.26333123 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

7 0.25220269 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

8 0.24538447 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

9 0.23835231 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

10 0.21511975 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs

11 0.21164908 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

12 0.19724789 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

13 0.19492725 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

14 0.19462286 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections

15 0.19026147 76 jmlr-2013-Nonparametric Sparsity and Regularization

16 0.18470146 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

17 0.18325709 68 jmlr-2013-Machine Learning with Operational Costs

18 0.18265417 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

19 0.18168847 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

20 0.18160249 63 jmlr-2013-Learning Trees from Strings: A Strong Learning Algorithm for some Context-Free Grammars


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.022), (5, 0.666), (6, 0.025), (10, 0.042), (20, 0.013), (23, 0.025), (53, 0.012), (68, 0.015), (70, 0.022), (75, 0.025), (85, 0.018), (87, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99839628 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java

Author: Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan, Yue Wang

Abstract: We describe a R-Java CAM (convex analysis of mixtures) package that provides comprehensive analytic functions and a graphic user interface (GUI) for blindly separating mixed nonnegative sources. This open-source multiplatform software implements recent and classic algorithms in the literature including Chan et al. (2008), Wang et al. (2010), Chen et al. (2011a) and Chen et al. (2011b). The CAM package offers several attractive features: (1) instead of using proprietary MATLAB, its analytic functions are written in R, which makes the codes more portable and easier to modify; (2) besides producing and plotting results in R, it also provides a Java GUI for automatic progress update and convenient visual monitoring; (3) multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, assuring that the whole CAM software runs responsively; (4) the package offers a simple mechanism to allow others to plug-in additional R-functions. Keywords: convex analysis of mixtures, blind source separation, affinity propagation clustering, compartment modeling, information-based model selection c 2013 Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan and Yue Wang. WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG 1. Overview Blind source separation (BSS) has proven to be a powerful and widely-applicable tool for the analysis and interpretation of composite patterns in engineering and science (Hillman and Moore, 2007; Lee and Seung, 1999). BSS is often described by a linear latent variable model X = AS, where X is the observation data matrix, A is the unknown mixing matrix, and S is the unknown source data matrix. The fundamental objective of BSS is to estimate both the unknown but informative mixing proportions and the source signals based only on the observed mixtures (Child, 2006; Cruces-Alvarez et al., 2004; Hyvarinen et al., 2001; Keshava and Mustard, 2002). While many existing BSS algorithms can usefully extract interesting patterns from mixture observations, they often prove inaccurate or even incorrect in the face of real-world BSS problems in which the pre-imposed assumptions may be invalid. There is a family of approaches exploiting the source non-negativity, including the non-negative matrix factorization (NMF) (Gillis, 2012; Lee and Seung, 1999). This motivates the development of alternative BSS techniques involving exploitation of source nonnegative nature (Chan et al., 2008; Chen et al., 2011a,b; Wang et al., 2010). The method works by performing convex analysis of mixtures (CAM) that automatically identifies pure-source signals that reside at the vertices of the multifaceted simplex most tightly enclosing the data scatter, enabling geometrically-principled delineation of distinct source patterns from mixtures, with the number of underlying sources being suggested by the minimum description length criterion. Consider a latent variable model x(i) = As(i), where the observation vector x(i) = [x1 (i), ..., xM (i)]T can be expressed as a non-negative linear combination of the source vectors s(i) = [s1 (i), ..., sJ (i)]T , and A = [a1 , ..., aJ ] is the mixing matrix with a j being the jth column vector. This falls neatly within the definition of a convex set (Fig. 1) (Chen et al., 2011a): X= J J ∑ j=1 s j (i)a j |a j ∈ A, s j (i) ≥ 0, ∑ j=1 s j (i) = 1, i = 1, ..., N . Assume that the sources have at least one sample point whose signal is exclusively enriched in a particular source (Wang et al., 2010), we have shown that the vertex points of the observation simplex (Fig. 1) correspond to the column vectors of the mixing matrix (Chen et al., 2011b). Via a minimum-error-margin volume maximization, CAM identifies the optimum set of the vertices (Chen et al., 2011b; Wang et al., 2010). Using the samples attached to the vertices, compartment modeling (CM) (Chen et al., 2011a) obtains a parametric solution of A, nonnegative independent component analysis (nICA) (Oja and Plumbley, 2004) estimates A (and s) that maximizes the independency in s, and nonnegative well-grounded component analysis (nWCA) (Wang et al., 2010) finds the column vectors of A directly from the vertex cluster centers. Figure 1: Schematic and illustrative flowchart of R-Java CAM package. 2900 T HE CAM S OFTWARE IN R-JAVA In this paper we describe a newly developed R-Java CAM package whose analytic functions are written in R, while a graphic user interface (GUI) is implemented in Java, taking full advantages of both programming languages. The core software suite implements CAM functions and includes normalization, clustering, and data visualization. Multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, which not only provides convenient data or parameter passing and visual progress monitoring but also assures the responsive execution of the entire CAM software. 2. Software Design and Implementation The CAM package mainly consists of R and Java modules. The R module is a collection of main and helper functions, each represented by an R function object and achieving an independent and specific task (Fig. 1). The R module mainly performs various analytic tasks required by CAM: figure plotting, update, or error message generation. The Java module is developed to provide a GUI (Fig. 2). We adopt the model-view-controller (MVC) design strategy, and use different Java classes to separately perform information visualization and human-computer interaction. The Java module also serves as the software driver and integrator that use a multi-thread strategy to facilitate the interactions between the R and Java modules, such as importing raw data, passing algorithmic parameters, calling R scripts, and transporting results and messages. Figure 2: Interactive Java GUI supported by a multi-thread design strategy. 2.1 Analytic and Presentation Tasks Implemented in R The R module performs the CAM algorithm and facilitates a suite of subsequent analyses including CM, nICA, and nWCA. These tasks are performed by the three main functions: CAM-CM.R, CAM-nICA.R, and CAM-nWCA.R, which can be activated by the three R scripts: Java-runCAM-CM.R, Java-runCAM-ICA.R, and Java-runCAM-nWCA.R. The R module also performs auxiliary tasks including automatic R library installation, figure drawing, and result recording; and offers other standard methods such as nonnegative matrix factorization (Lee and Seung, 1999), Fast ICA (Hyvarinen et al., 2001), factor analysis (Child, 2006), principal component analysis, affinity propagation, k-means clustering, and expectation-maximization algorithm for learning standard finite normal mixture model. 2.2 Graphic User Interface Written in Java Swing The Java GUI module allows users to import data, select algorithms and parameters, and display results. The module encloses two packages: guiView contains classes for handling frames and 2901 WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG Figure 3: Application of R-Java CAM to deconvolving dynamic medical image sequence. dialogs for managing user inputs; guiModel contains classes for representing result data sets and for interacting with the R script caller. Packaged as one jar file, the GUI module runs automatically. 2.3 Functional Interaction Between R and Java We adopt the open-source program RCaller (http://code.google.com/p/rcaller) to implement the interaction between R and Java modules (Fig. 2), supported by explicitly designed R scripts such as Java-runCAM-CM.R. Specifically, five featured Java classes are introduced to interact with R for importing data or parameters, running algorithms, passing on or recording results, displaying figures, and handing over error messages. The examples of these classes include guiModel.MyRCaller.java, guiModel.MyRCaller.readResults(), and guiView.MyRPlotViewer. 3. Case Studies and Experimental Results The CAM package has been successfully applied to various data types. Using dynamic contrastenhanced magnetic resonance imaging data set of an advanced breast cancer case (Chen, et al., 2011b),“double click” (or command lines under Ubuntu) activated execution of CAM-Java.jar reveals two biologically interpretable vascular compartments with distinct kinetic patterns: fast clearance in the peripheral “rim” and slow clearance in the inner “core”. These outcomes are consistent with previously reported intratumor heterogeneity (Fig. 3). Angiogenesis is essential to tumor development beyond 1-2mm3 . It has been widely observed that active angiogenesis is often observed in advanced breast tumors occurring in the peripheral “rim” with co-occurrence of inner-core hypoxia. This pattern is largely due to the defective endothelial barrier function and outgrowth blood supply. In another application to natural image mixtures, CAM algorithm successfully recovered the source images in a large number of trials (see Users Manual). 4. Summary and Acknowledgements We have developed a R-Java CAM package for blindly separating mixed nonnegative sources. The open-source cross-platform software is easy-to-use and effective, validated in several real-world applications leading to plausible scientific discoveries. The software is freely downloadable from http://mloss.org/software/view/437/. We intend to maintain and support this package in the future. This work was supported in part by the US National Institutes of Health under Grants CA109872, CA 100970, and NS29525. We thank T.H. Chan, F.Y. Wang, Y. Zhu, and D.J. Miller for technical discussions. 2902 T HE CAM S OFTWARE IN R-JAVA References T.H. Chan, W.K. Ma, C.Y. Chi, and Y. Wang. A convex analysis framework for blind separation of non-negative sources. IEEE Transactions on Signal Processing, 56:5120–5143, 2008. L. Chen, T.H. Chan, P.L. Choyke, and E.M. Hillman et al. Cam-cm: a signal deconvolution tool for in vivo dynamic contrast-enhanced imaging of complex tissues. Bioinformatics, 27:2607–2609, 2011a. L. Chen, P.L. Choyke, T.H. Chan, and C.Y. Chi et al. Tissue-specific compartmental analysis for dynamic contrast-enhanced mr imaging of complex tumors. IEEE Transactions on Medical Imaging, 30:2044–2058, 2011b. D. Child. The essentials of factor analysis. Continuum International, 2006. S.A. Cruces-Alvarez, Andrzej Cichocki, and Shun ichi Amari. From blind signal extraction to blind instantaneous signal separation: criteria, algorithms, and stability. IEEE Transactions on Neural Networks, 15:859–873, 2004. N. Gillis. Sparse and unique nonnegative matrix factorization through data preprocessing. Journal of Machine Learning Research, 13:3349–3386, 2012. E.M.C. Hillman and A. Moore. All-optical anatomical co-registration for molecular imaging of small animals using dynamic contrast. Nature Photonics, 1:526–530, 2007. A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley, New York, 2001. N. Keshava and J.F. Mustard. Spectral unmixing. IEEE Signal Processing Magazine, 19:44–57, 2002. D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999. E. Oja and M. Plumbley. Blind separation of positive sources by globally convergent gradient search. Neural Computation, 16:1811–1825, 2004. F.Y. Wang, C.Y. Chi, T.H. Chan, and Y. Wang. Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:857–888, 2010. 2903

same-paper 2 0.9968304 8 jmlr-2013-A Theory of Multiclass Boosting

Author: Indraneel Mukherjee, Robert E. Schapire

Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games

3 0.99620879 24 jmlr-2013-Comment on "Robustness and Regularization of Support Vector Machines" by H. Xu et al. (Journal of Machine Learning Research, vol. 10, pp. 1485-1510, 2009)

Author: Yahya Forghani, Hadi Sadoghi

Abstract: This paper comments on the published work dealing with robustness and regularization of support vector machines (Journal of Machine Learning Research, Vol. 10, pp. 1485-1510, 2009) by H. Xu et al. They proposed a theorem to show that it is possible to relate robustness in the feature space and robustness in the sample space directly. In this paper, we propose a counter example that rejects their theorem. Keywords: kernel, robustness, support vector machine 1. Comment Firstly, it must be stated that Xu et al. (2009) made a good study of robustness and regularization of support vector machines. They proposed the following theorem to show that it is possible to relate robustness in the feature space and robustness in the sample space directly: Theorem (Xu et al., 2009) Suppose that the kernel function has the form k(x, x′ ) = f ( x − x′ ), with f : R+ → R a decreasing function. Denote by H the RKHS space of k(., .) and φ(.) the corresponding feature mapping. Then we have any x ∈ Rn , w ∈ H and c > 0, sup w, φ(x − δ) = δ ≤c δφ H≤ sup √ 2 f (0)−2 f (c) w, φ(x) − δφ . The following counter example rejects the mentioned theorem. However, this theorem is a standalone result in the appendix of the paper of Xu et al. (2009), which is not used anywhere else in the paper of Xu et al. (2009). Thus, the main result and all other results of Xu et al. (2009) are not affected in any way. Counter example. Let φ(.) be the feature mapping of Gaussian kernel function. We have φ(x) H = 1. Let w = φ(x). Therefore, w, φ(x) = w H , and sup w, φ(x − δ) = w δ ≤c ∗. Also at Islamic Azad University, Mashhad Branch, Mashhad, IRAN. c 2013 Yahya Forghani and Hadi Sadoghi. H. (1) F ORGHANI AND S ADOGHI Moreover, δφ δφ H≤ H≤ w, φ(x) − δφ = sup √ 2 f (0)−2 f (c) w, φ(x) + sup √ 2 f (0)−2 f (c) w H δφ + δφ w H H≤ + w H≤ w, δφ = sup √ 2 f (0)−2 f (c) w, δφ = sup √ 2 f (0)−2 f (c) H 2 f (0) − 2 f (c). (2) According to Equation (1) and (2), and since f is a decreasing function, for any c > 0, we have sup w, φ(x − δ) ≤ δ ≤c δφ H≤ sup √ 2 f (0)−2 f (c) w, φ(x) − δφ . End of counter example. The exact spot that the error has been occurred in the mentioned theorem is Equation (19) of the paper of Xu et al. (2009). There it has been claimed that the image of the RKHS feature mapping is dense, which unfortunately is not true. Indeed, because φ(x), φ(x) = K(0) where K(.) is the kernel function, the image of the feature mapping is in a ball of radius K(0). References Huan Xu, Shie Mannor, and Constantine Caramanis. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10:1485–1510, 2009. 3494

4 0.99434686 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

Author: Bruno Scherrer

Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms

5 0.99232799 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

Author: Naftali Harris, Mathias Drton

Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution

6 0.97644395 15 jmlr-2013-Bayesian Canonical Correlation Analysis

7 0.9614228 114 jmlr-2013-The Rate of Convergence of AdaBoost

8 0.90446442 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

9 0.89690459 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

10 0.89324188 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

11 0.89062572 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

12 0.87803984 73 jmlr-2013-Multicategory Large-Margin Unified Machines

13 0.876028 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

14 0.87323153 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

15 0.86993611 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

16 0.86021739 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

17 0.85495794 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

18 0.8513515 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs

19 0.85102659 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

20 0.84845728 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion