jmlr jmlr2012 jmlr2012-8 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
Reference: text
sentIndex sentText sentNum sentScore
1 Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. [sent-4, score-0.358]
2 Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy 1. [sent-6, score-0.229]
3 The existence of any such method was unknown until the breakthrough result of Schapire (1990): under a weak learning assumption, it is possible to combine many carefully chosen weak learners into a majority of majorities with arbitrarily low training error. [sent-8, score-0.342]
4 Given the wide practical success of boosting with the logistic loss, it is perhaps surprising that no convergence rate better than O (exp(1/ε2 )) was known, even under the weak learning assumption (Bickel et al. [sent-14, score-0.343]
5 In particular, it is shown that the (disjoint) scenarios of weak learnability (Section 6. [sent-23, score-0.264]
6 This primal formulation of boosting obscures a key internal mechanism: boosting iteratively constructs distributions where the previously selected weak learner fails. [sent-32, score-0.485]
7 This view is recovered in the dual problem; specifically, Section 3 reveals that the dual feasible set is the collection of distributions where all weak learners have no correlation to the target, and the dual objective is a max entropy rule. [sent-33, score-0.734]
8 The dual optimum is always attainable; since a standard mechanism in convergence analysis to control the distance to the optimum, why not overcome the unattainability of the primal optimum by working in the dual? [sent-34, score-0.35]
9 It turns out that the classical weak learning rate was a mechanism to control distances in the dual all along; by developing a suitable generalization (Section 4), it is possible to convert the improvement due to a single step of coordinate descent into a relevant distance in the dual (Section 6). [sent-35, score-0.541]
10 The classical scenarios of attainability and weak learnability are identifiable directly from the weak learning class and training sample; moreover, they can be entirely characterized by properties of the primal and dual problems. [sent-39, score-0.79]
11 This set is central—for instance, the dual optimum (regardless of the loss function) places positive weight on exactly the hard core. [sent-41, score-0.243]
12 Weak learnability corresponds to the hard core being empty, and attainability corresponds to it being the whole training set. [sent-42, score-0.263]
13 For those instances where the hard core is a nonempty proper subset of the training set, the behavior on and off the hard core mimics attainability and weak learnability, and Section 6. [sent-43, score-0.369]
14 Of perhaps practical 562 A P RIMAL -D UAL C ONVERGENCE A NALYSIS OF B OOSTING interest, Section D provides methods to select the step size, meaning the weight with which new weak learners are included in the full predictor. [sent-47, score-0.235]
15 (2011) used the same boosting instance (due to Schapire 2010) to produce lower bounds, and also decomposed the boosting problem into finite and infinite margin pieces (cf. [sent-73, score-0.246]
16 These methods generally impose some form of regularization (Shalev-Shwartz and Singer, 2008), which grants attainability of the risk minimizer, and allows standard techniques to grant general convergence rates. [sent-80, score-0.243]
17 Let ai denote the ith row of A, corresponding to the example (xi , yi ), and let {h j }n index the set of weak learners corresponding to columns of A. [sent-93, score-0.232]
18 As another example, if the weak learners are binary, and H has VC dimension d, then Sauer’s lemma grants that A has at most (m + 1)d columns. [sent-97, score-0.3]
19 This matrix view of boosting is thus similar to the interpretation of boosting performing descent in functional space (Mason et al. [sent-98, score-0.273]
20 Thus the infimum is never attainable when weak learnability holds. [sent-131, score-0.339]
21 In the case of the exponential loss and binary weak learners, the line search (when attainable) has a convenient closed form; but for other losses, and even with the exponential loss but with 565 T ELGARSKY Routine B OOST. [sent-139, score-0.291]
22 the intuitive operation of boosting as generating distributions where the current predictor is highly erroneous, and requesting weak learners accurate on these tricky distributions. [sent-178, score-0.314]
23 In addition to illuminating the structure of boosting, the dual problem also possesses a major concrete contribution to the optimization behavior, and specifically the convergence rates: the dual optimum is always attainable. [sent-180, score-0.408]
24 The dual problem will make use of Fenchel conjugates (Hiriart-Urruty and Lemar´ chal, 2001; e Borwein and Lewis, 2000); for any function h, the conjugate is h∗ (φ) = sup x∈dom(h) x, φ − h(x). [sent-181, score-0.255]
25 ) There is one more object to present, the dual feasible set ΦA . [sent-192, score-0.209]
26 Since ψ ∈ Ker(A⊤ ), this is a weighting of examples which decorrelates all weak learners from the target: in particular, for any primal weighting λ ∈ Rn over weak learners, ψ⊤ Aλ = 0. [sent-194, score-0.4]
27 The right hand side is the dual problem, and moreover the dual i=1 f optimum, denoted ψA , is unique and attainable. [sent-198, score-0.334]
28 Remark 6 It may be tempting to use Theorem 4 to produce a stopping condition; that is, if for a supplied ε > 0, a primal iterate λ′ and dual feasible ψ′ ∈ ΦA can be found satisfying f (Aλ′ ) + f ∗ (ψ′ ) ≤ ε, B OOST may terminate with the guarantee f (Aλ′ ) − f¯A ≤ ε. [sent-215, score-0.309]
29 This section will generalize the weak learning rate into a quantity which can be made positive for any boosting instance. [sent-224, score-0.274]
30 Returning to task, the weak learning assumption posits the existence of a positive constant, the weak learning rate γ, which lower bounds the correlation of the best weak learner with the target for any distribution. [sent-230, score-0.483]
31 Stated in terms of the matrix A, 0 < γ = infm max m ∑ (φ)i yi h j (xi ) = φ∈R+ j∈[n] i=1 φ =1 inf m φ∈R+ \{0m } 569 A⊤ φ ∞ A⊤ φ ∞ = inf . [sent-231, score-0.232]
32 φ 1 φ∈Rm \{0m } φ − 0m 1 + (3) T ELGARSKY Proposition 7 A boosting instance is weak learnable iff ΦA = {0m }. [sent-232, score-0.3]
33 φ 1 φ′′ 1 Following this connection, the first way in which the weak learning rate is modified is to replace {0m } with the dual feasible set ΦA = Ker(A⊤ ) ∩ Rm . [sent-236, score-0.36]
34 First note that in the scenario of weak learnability (i. [sent-239, score-0.264]
35 Meanwhile, the left hand side has reduced the influence of A to a single number, and the normed expression is the distance to a restriction of dual feasible set, which will converge to zero if the infimum is to be approached, so long as this restriction contains the dual optimum. [sent-246, score-0.376]
36 As a final connection, since A⊤ P1 (φ) = 0n , note that S∩Ker(A⊤ ) γ(A, S) = inf φ∈S\Ker(A⊤ ) A⊤ φ ∞ = φ − P1 (φ) S∩Ker(A⊤ ) 1 inf φ∈S\Ker(A⊤ ) A⊤ (φ − P1 (φ)) S∩Ker(A⊤ ) φ − P1 (φ) S∩Ker(A⊤ ) ∞ . [sent-254, score-0.232]
37 Optimization Structure The scenario of weak learnability translates into a simple condition on the dual feasible set: the dual feasible set is the origin (in symbols, ΦA = Ker(A⊤ ) ∩ Rm = {0m }). [sent-257, score-0.705]
38 This section will identify the structure of the boosting optimization problem both in terms of the primal and dual problems, first studying the scenarios of weak learnability and attainability, and then showing that general instances can be decomposed into these two. [sent-259, score-0.612]
39 The dual feasible set ΦA = Ker(A⊤ )∩Rm is the set of nonnegative weightings of examples + under which every weak learner (every column of A) has zero correlation; what is the support of these weightings? [sent-261, score-0.39]
40 Definition 10 H(A) denotes the hard core of A: the collection of examples which receive positive weight under some dual feasible point, a distribution upon which no weak learner is correlated with the target. [sent-262, score-0.39]
41 One case has already been considered; as established in Theorem 7, weak learnability is equivalent to ΦA = {0m }, which in turn is equivalent to |H(A)| = 0. [sent-264, score-0.264]
42 Indeed, contrasted with the primal and dual problems and feasible sets, H(A) will provide a conceptually simple, discrete object with which to comprehend the behavior of boosting. [sent-266, score-0.267]
43 571 (6) (7) (8) (9) T ELGARSKY Figure 4: Geometric view of the primal and dual problem, under weak learnability. [sent-270, score-0.376]
44 First note that Equation 9 indicates (via Theorem 7) this is indeed the weak learnability setting, equivalently |H(A)| = 0. [sent-274, score-0.264]
45 Recall the earlier discussion of boosting as searching for a halfspace containing the points {−ai }m = {−e⊤ A}m ; Equation 6 encodes precisely this statement, and moreover that there exists i 1 1 such a halfspace with these points interior to it. [sent-275, score-0.374]
46 Equation 9 and Equation 6 can be interpreted geometrically, as depicted in Figure 4: the dual feasibility statement is that no convex combination of {−ai }m will contain the origin. [sent-277, score-0.216]
47 1 Next, Equation 7 is the (error part of the) usual strong PAC guarantee (Schapire, 1990): weak learnability entails that the training error will go to zero. [sent-278, score-0.264]
48 Then, since f > 0 and limx→−∞ g(x) = 0, 1 ¯ inf f (Aλ) ≤ lim f (ci Aλ) = 0 ≤ inf f (Aλ). [sent-282, score-0.256]
49 ) The point 0m is always dual feasible, and inf f (Aλ) = 0 = − f ∗ (0m ). [sent-284, score-0.283]
50 λ Since the dual optimum is unique (Theorem 4), ψA = 0m . [sent-285, score-0.218]
51 Note that 0-coercivity means the domain of the infimum in Equation 1 can be restricted to a compact set, and attainability in turn follows just from properties of minimization of continuous functions on compact sets. [sent-300, score-0.222]
52 For the converse, note ++ 573 T ELGARSKY Figure 5: Geometric view of the primal and dual problem, under attainability. [sent-312, score-0.225]
53 Finally, note Equation 12: it is not only the case that there are dual feasible points fully interior to Rm , but furthermore the dual optimum is also interior. [sent-322, score-0.468]
54 3 General Setting So far, the scenarios of weak learnability and attainability corresponded to the extremal hard core cases of |H(A)| ∈ {0, m}. [sent-350, score-0.445]
55 The main result of this section will have the same two main ingredients as Proposition 16: • The full boosting instance may be uniquely decomposed into two pieces, A0 and A+ , each of which individually behave like the weak learnability and attainability scenarios. [sent-368, score-0.537]
56 To continue with geometric interpretations, notice that Equation 15 states that there exists a halfspace strictly containing those points in [m] \ H(A), with all points of H(A) on its boundary; furthermore, trying to adjust this halfspace to contain elements of H(A) will place others outside it. [sent-376, score-0.236]
57 With regards to the geometry of the dual feasible set as provided by Equation 18, the origin is within the relative interior of the points corresponding to H(A), however the convex hull of the other m − |H(A)| points can not contain the origin. [sent-377, score-0.322]
58 Then ¯ inf f (B+ λ) = lim f (B+ λi ) + f (ci B0 λ) ≥ inf f (Aλ) λ λ i→∞ = inf( f (B+ λ) + f (B0 λ)) ≥ inf f (B+ λ), λ λ 576 A P RIMAL -D UAL C ONVERGENCE A NALYSIS OF B OOSTING Figure 6: Geometric view of the primal and dual problem in the general case. [sent-385, score-0.597]
59 Thus f − f ∗ (ψA ) = sup − f ∗ (ψ) ψ∈ΦA = sup ψz ∈Rz + p ψ p ∈R+ ⊤ ψ +B⊤ ψ =0 B0 z + p n − f ∗ (ψz ) − f ∗ (ψ p ) ≥ sup − f ∗ (ψz ) + sup − f ∗ (ψ p ) ψ p ∈ΦB+ ψz ∈ΦB0 = 0− f ∗ f (ψB+ ) = infn f (B+ λ) = infn f (Aλ) = − f ∗ (ψA ). [sent-394, score-0.238]
60 Correspondingly, q f by Theorem 14, the dual optimum ψAq of this restricted problem will have only positive entries. [sent-406, score-0.218]
61 Therefore this restriction ψA of ψA to Iq will have at least one zero f entry, meaning it can not be equal to ψAq ; but Theorem 4 provided that the dual optimum is unique, f f f f ˆf ¯f thus − f ∗ (ψAq ) > − f ∗ (ψA ). [sent-408, score-0.262]
62 The dual optimization in Theorem 4 can then be interpreted as selecting the max entropy choice (per f ∗ ) amongst those convex combinations of H(A) equal to the origin. [sent-427, score-0.216]
63 However, outside the weak learnability case, the other terms in the bounds here can also incur a large penalty with the exponential loss, and there is some evidence that this is unavoidable (see the lower bounds in Mukherjee et al. [sent-436, score-0.291]
64 As discussed previously, standard techniques for analyzing descent methods provide such bounds in terms of gradients, however to overcome the difficulty of unattainability in the primal space, the key will be to convert this into distances in the dual via γ(A, S), as in Equation 5. [sent-440, score-0.252]
65 The task now is to manage the dual distance D1 (∇ f (Aλt )), specifically to produce a reS∩Ker(A⊤ ) lation to f (Aλt ) − f¯A , the total suboptimality in the preceding iteration; from there, standard tools in convex optimization will yield convergence rates. [sent-446, score-0.314]
66 Matching the problem structure revealed in Section 5, first the extremal cases of weak learnability and attainability will be handled, and only then the general case. [sent-447, score-0.445]
67 580 A P RIMAL -D UAL C ONVERGENCE A NALYSIS OF B OOSTING This has an immediate consequence to the task of relating f (Aλt ) − f¯A to the dual distance f is a strictly convex function, which means it is strongly convex over any compact set. [sent-471, score-0.327]
68 This also provides the choice of polyhedron S in γ(A, S): unlike the case of weak learnability, where the unbounded set Rm was used, a compact set containing the initial level + set will be chosen. [sent-473, score-0.228]
69 To later show γ(A, S) > 0 via Theorem 9, S ∞ must be polyhedral, and to apply Proposition 20, it must contain the dual iterates {∇ f (Aλt )}t=1 ; the easiest choice then is to take the bounding box C of the initial level set, and use its dual map ∇ f (C ). [sent-490, score-0.334]
70 To exhibit dual feasible points within ∇ f (C ), note that C will contain a primal minimizer, and optimality conditions grant that ∇ f (C ) contains the dual optimum. [sent-491, score-0.434]
71 Proof of Lemma 25 Consider the optimization problem inf infm ∇2 f (x)φ, φ = inf infm x∈S φ∈R φ 2 =1 m ∑ g′′ (xi )φ2 ; i x∈S φ∈R i=1 φ 2 =1 since S is compact and g′′ and (·)2 are continuous, the infimum is attainable. [sent-504, score-0.268]
72 Consider the dual element P2 (∇ f (x)) (which exists since D ∩ K = 0); due to the D∩K projection, it is dual feasible, and thus it must follow from Theorem 4 that f¯A = sup{− f ∗ (ψ) : ψ ∈ ΦA } ≥ − f ∗ P2 (∇ f (x)) . [sent-511, score-0.334]
73 What is really granted by attainability is that every iterate lies well within the interior of dom( f ∗ ), and therefore these Bregman divergences, which depend on ∇ f ∗ , can not become too wild. [sent-528, score-0.233]
74 On the other hand, with weak learnability, all dual weights go to zero (cf. [sent-529, score-0.318]
75 3 was that general instances may be decomposed uniquely into two smaller pieces, one satisfying attainability and the other satisfying weak learnability, and that these smaller problems behave somewhat independently. [sent-534, score-0.301]
76 This independence is leveraged here to produce convergence rates relying upon the existing rate analysis for the attainable and weak learnable cases. [sent-535, score-0.249]
77 And with the exponential loss, taking S := {λ ∈ Rn : f (Aλ) ≤ f (Aλ0 )} to denote the initial level set, since 0m is always dual feasible, w ≤ sup ∇ f (Aλ) λ∈S 1 = sup f (Aλ) = f (Aλ0 ) = m. [sent-545, score-0.272]
78 + Returning to the total suboptimality expression Equation 22, these dual distance bounds yield f (Aλt ) − f¯A ≤ βD1 A (∇ f (A0 λt )) + w/(2c)D1 f (C+ )∩Ker(A⊤ ) (∇ f (A+ λt )) Φ ∇ + 0 ≤ (β + w/(2c))D1 m0 ×∇ f (C ))∩Ker(A⊤ ) (∇ f (Aλt )), (R+ + the second step using Lemma 47. [sent-559, score-0.242]
79 In order to produce a rate O (ln(1/ε)) under attainability, strong convexity related the suboptimality to a squared dual distance · 2 (cf. [sent-563, score-0.279]
80 On the other hand, the rate O (ln(1/ε)) 1 under weak learnability came from a fortuitous cancellation with the denominator f (Aλt ) (cf. [sent-565, score-0.264]
81 Although strong convexity in the primal grants the existence of a lower bounding quadratic, it grants upper bounds in the dual. [sent-673, score-0.235]
82 ) For any x < y, g′ (y) = g′ (x) + y x g′′ (t)dt ≥ g′ (x) + (y − x) inf g′′ (t) > g′ (x), t∈[x,y] 589 T ELGARSKY thus g′ strictly increases, granting strict convexity (Hiriart-Urruty and Lemar´ chal, 2001, Theorem e B. [sent-685, score-0.214]
83 ) Instead consider requiring the weak learning oracle to return some hypothesis at least a fraction c0 ∈ (0, 1] as good as the best weak learner in the class; written in the present framework, the direction v must satisfy v⊤ ∇( f ◦ A)(λt ) ≤ −c0 A⊤ ∇ f (Aλt ) ∞ . [sent-826, score-0.332]
84 Taking the form of the classical weak learning rate from Equation 3 as a model, the template generalized weak learning rate is A⊤ φ ∞ φ∈S\C infψ∈S∩D φ − ψ γ′ (A, S,C, D) := inf , 1 for some sets S, C, and D (for instance, the classical weak learning rate uses S = Rm and C = D = + {0m }). [sent-835, score-0.569]
85 f φ − ψA 1 This form agrees with the original γ when weak learnability holds, and will lead to a very convenient analog to Equation 5. [sent-840, score-0.264]
86 f Then inf S⊤ φ φ∈Rm \Ker(S⊤ ) + ∞ φ − ψS f 1 ≤ inf α∈(0,1) S⊤ (φα + ψα ) φα + ψα − ψS f ∞ 1 = inf α∈(0,1) −α −α ∞ 1 = 0. [sent-845, score-0.348]
87 For the lower bound, by Lemma 42 and norm equivalence, γ(A, S) = ≥ inf A⊤ φ φ∈S\Ker(A⊤ ) 1 √ mn ∞ infψ∈S∩Ker(A⊤ ) φ − ψ 1 A⊤ φ inf φ∈S\Ker(A⊤ ) 2 infψ∈S∩Ker(A⊤ ) φ − ψ > 0. [sent-943, score-0.232]
88 More¯A ≤ f (0m ) and d ≥ − f ∗ (0m ) = 0, the optimum is finite, and thus the same theorem over, since f grants that it is attainable in the dual. [sent-962, score-0.244]
89 (Note that the negation was simply to be + able to interpret feasible dual variables as nonnegative measures. [sent-966, score-0.24]
90 Specifically, if there were some other optimal ψ′ = ψ, the point (ψ + ψ′ )/2 is dual feasible and has strictly larger objective value, a contradiction. [sent-971, score-0.235]
91 f), and the proof of Theorem 4, where a negation was inserted into the dual to allow dual points to be interpreted as nonnegative measures). [sent-1007, score-0.391]
92 ¯ But the dual optimum is dual feasible, and Aλ ∈ Sd , so ¯ / ∇ f (C ) ∩ ΦA ⊇ {∇ f (Aλ)} ∩ ΦA = {ψA } ∩ ΦA = 0. [sent-1008, score-0.385]
93 By the nature of exact line search, the coordinates of λ are updated in alternation (with arbitrary initial choice); thus let ut denote the value of the coordinate updated in iteration t, and vt be the one which is held fixed. [sent-1024, score-0.25]
94 ) The objective function, written in terms of (ut , vt ), is = ln 1 + exp(vt − ut ) + ln 1 + exp(ut − vt ) + ln 1 + exp(−ut − vt ) ln 2 + exp(vt − ut ) + exp(ut − vt ) + 2 exp(−ut − vt ) + exp(−2ut ) + exp(−2vt ) . [sent-1026, score-1.277]
95 In particular, producing this equality and multiplying both sides by the (nonzero) denominator yields − exp(vt − ut ) + exp(ut − vt ) − 2 exp(−ut − vt ) − 2 exp(−2ut ) = 0. [sent-1028, score-0.358]
96 Multiplying by exp(ut + vt ) and rearranging, it follows that, after line search, ut and vt must satisfy exp(2ut ) = exp(2vt ) + 2 exp(vt − ut ) + 2. [sent-1029, score-0.442]
97 (35) First it will be shown for t ≥ 1, by induction, that ut ≥ vt . [sent-1030, score-0.221]
98 Now the inductive hypothesis grants ut ≥ vt ; the case ut = vt can be directly handled by Equation 35, thus suppose ut > vt . [sent-1032, score-0.733]
99 2 To finish, recall by Taylor expansion that ln(1 + q) ≥ q − q ; consequently for t ≥ 1 2 f (Sλt ) − f¯S ≥ λ 1 f (Sλ) − f¯S ≥ ln 1 + 4t 1 ≤ln(4t) inf ≥ 1 1 − 4t 2 1 4t 2 ≥ 1 . [sent-1039, score-0.222]
100 On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. [sent-1150, score-0.387]
wordName wordTfidf (topN-words)
[('ker', 0.663), ('rm', 0.182), ('dual', 0.167), ('elgarsky', 0.157), ('xf', 0.152), ('weak', 0.151), ('attainability', 0.15), ('rimal', 0.15), ('vt', 0.137), ('dom', 0.135), ('oosting', 0.128), ('boosting', 0.123), ('lemar', 0.121), ('inf', 0.116), ('chal', 0.115), ('learnability', 0.113), ('im', 0.112), ('ual', 0.106), ('ln', 0.106), ('halfspace', 0.105), ('onvergence', 0.094), ('ut', 0.084), ('aq', 0.079), ('attainable', 0.075), ('suboptimality', 0.075), ('nalysis', 0.075), ('grants', 0.07), ('rz', 0.07), ('nonempty', 0.068), ('mum', 0.064), ('aff', 0.061), ('fenchel', 0.059), ('primal', 0.058), ('borwein', 0.058), ('exp', 0.057), ('equation', 0.057), ('schapire', 0.052), ('proposition', 0.052), ('optimum', 0.051), ('convex', 0.049), ('theorem', 0.048), ('limx', 0.048), ('nf', 0.047), ('logistic', 0.046), ('rectangle', 0.045), ('meaning', 0.044), ('feasible', 0.042), ('iq', 0.042), ('oost', 0.042), ('iterate', 0.042), ('interior', 0.041), ('ai', 0.041), ('polyhedron', 0.041), ('infn', 0.041), ('learners', 0.04), ('sup', 0.039), ('lemma', 0.039), ('hs', 0.038), ('manuscript', 0.037), ('convexity', 0.037), ('nocedal', 0.036), ('closed', 0.036), ('compact', 0.036), ('rn', 0.035), ('granting', 0.035), ('motzkin', 0.034), ('olfe', 0.034), ('stiemke', 0.034), ('adaboost', 0.033), ('lewis', 0.033), ('polyhedral', 0.032), ('extremal', 0.031), ('negation', 0.031), ('modulus', 0.031), ('td', 0.031), ('learner', 0.03), ('losses', 0.029), ('coordinate', 0.029), ('descent', 0.027), ('gordan', 0.027), ('int', 0.027), ('bregman', 0.027), ('exponential', 0.027), ('strictly', 0.026), ('proof', 0.026), ('conjugate', 0.026), ('cs', 0.026), ('robert', 0.026), ('iff', 0.026), ('df', 0.026), ('wolfe', 0.026), ('loss', 0.025), ('sd', 0.024), ('lim', 0.024), ('convergence', 0.023), ('conjugates', 0.023), ('hiriarturruty', 0.023), ('tightest', 0.023), ('origin', 0.023), ('collins', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
2 0.1287519 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
3 0.12120181 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
Author: Ji Liu, Peter Wonka, Jieping Ye
Abstract: We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X ∈ Rn×m (m ≫ n) and a noisy observation vector y ∈ Rn satisfying y = Xβ∗ + ε where ε is the noise vector following a Gaussian distribution N(0, σ2 I), how to recover the signal (or parameter vector) β∗ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal β∗ . We show that if X obeys a certain condition, then with a large probability ˆ the difference between the solution β estimated by the proposed method and the true solution β∗ measured in terms of the ℓ p norm (p ≥ 1) is bounded as ˆ β − β∗ p ≤ C(s − N)1/p log m + ∆ σ, where C is a constant, s is the number of nonzero entries in β∗ , the risk of the oracle estimator ∆ is independent of m and is much smaller than the first term, and N is the number of entries of β∗ larger √ than a certain value in the order of O (σ log m). The proposed √ method improves the estimation √ bound of the standard Dantzig selector approximately from Cs1/p log mσ to C(s − N)1/p log mσ where the value N depends on the number of large entries in β∗ . When N = s, the proposed algorithm achieves the oracle solution with a high probability, where the oracle solution is the projection of the observation vector y onto true features. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. Finally, we extend this multi-stage procedure to the LASSO case. Keywords: multi-stage, Dantzig selector, LASSO, sparse signal recovery
4 0.10595021 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
Author: Blaine Nelson, Benjamin I. P. Rubinstein, Ling Huang, Anthony D. Joseph, Steven J. Lee, Satish Rao, J. D. Tygar
Abstract: Classifiers are often used to detect miscreant activities. We study how an adversary can systematically query a classifier to elicit information that allows the attacker to evade detection while incurring a near-minimal cost of modifying their intended malfeasance. We generalize the theory of Lowd and Meek (2005) to the family of convex-inducing classifiers that partition their feature space into two sets, one of which is convex. We present query algorithms for this family that construct undetected instances of approximately minimal cost using only polynomially-many queries in the dimension of the space and in the level of approximation. Our results demonstrate that nearoptimal evasion can be accomplished for this family without reverse engineering the classifier’s decision boundary. We also consider general ℓ p costs and show that near-optimal evasion on the family of convex-inducing classifiers is generally efficient for both positive and negative convexity for all levels of approximation if p = 1. Keywords: query algorithms, evasion, reverse engineering, adversarial learning
5 0.093084119 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel
Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data. It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semidefinite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting methods are easy to implement, efficient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classification accuracy and running time. Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor
6 0.0872209 82 jmlr-2012-On the Necessity of Irrelevant Variables
7 0.083892792 111 jmlr-2012-Structured Sparsity and Generalization
8 0.083177917 80 jmlr-2012-On Ranking and Generalization Bounds
9 0.066304088 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package
10 0.061991282 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
11 0.058586534 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
12 0.0565206 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
13 0.054649267 73 jmlr-2012-Multi-task Regression using Minimal Penalties
14 0.052380525 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
15 0.051420923 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
16 0.051381387 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
17 0.048809469 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
18 0.048723876 91 jmlr-2012-Plug-in Approach to Active Learning
19 0.046004936 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
20 0.043760687 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
topicId topicWeight
[(0, -0.229), (1, 0.014), (2, -0.112), (3, 0.069), (4, -0.232), (5, 0.057), (6, 0.081), (7, -0.053), (8, 0.045), (9, 0.146), (10, -0.095), (11, 0.002), (12, 0.105), (13, 0.173), (14, -0.022), (15, -0.113), (16, 0.087), (17, -0.018), (18, -0.022), (19, 0.077), (20, 0.171), (21, 0.06), (22, -0.084), (23, 0.086), (24, 0.089), (25, 0.008), (26, 0.141), (27, -0.069), (28, -0.031), (29, 0.01), (30, 0.023), (31, 0.048), (32, -0.057), (33, -0.024), (34, 0.025), (35, 0.036), (36, -0.025), (37, -0.042), (38, 0.018), (39, -0.08), (40, 0.093), (41, -0.033), (42, 0.053), (43, -0.01), (44, 0.047), (45, -0.082), (46, 0.009), (47, -0.167), (48, -0.042), (49, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.92539495 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
2 0.51500559 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
Author: Blaine Nelson, Benjamin I. P. Rubinstein, Ling Huang, Anthony D. Joseph, Steven J. Lee, Satish Rao, J. D. Tygar
Abstract: Classifiers are often used to detect miscreant activities. We study how an adversary can systematically query a classifier to elicit information that allows the attacker to evade detection while incurring a near-minimal cost of modifying their intended malfeasance. We generalize the theory of Lowd and Meek (2005) to the family of convex-inducing classifiers that partition their feature space into two sets, one of which is convex. We present query algorithms for this family that construct undetected instances of approximately minimal cost using only polynomially-many queries in the dimension of the space and in the level of approximation. Our results demonstrate that nearoptimal evasion can be accomplished for this family without reverse engineering the classifier’s decision boundary. We also consider general ℓ p costs and show that near-optimal evasion on the family of convex-inducing classifiers is generally efficient for both positive and negative convexity for all levels of approximation if p = 1. Keywords: query algorithms, evasion, reverse engineering, adversarial learning
3 0.51115513 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
4 0.49253386 80 jmlr-2012-On Ranking and Generalization Bounds
Author: Wojciech Rejchel
Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process
5 0.47455338 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
Author: Tim van Erven, Mark D. Reid, Robert C. Williamson
Abstract: Mixability of a loss characterizes fast rates in the online learning setting of prediction with expert advice. The determination of the mixability constant for binary losses is straightforward but opaque. In the binary case we make this transparent and simpler by characterising mixability in terms of the second derivative of the Bayes risk of proper losses. We then extend this result to multiclass proper losses where there are few existing results. We show that mixability is governed by the maximum eigenvalue of the Hessian of the Bayes risk, relative to the Hessian of the Bayes risk for log loss. We conclude by comparing our result to other work that bounds prediction performance in terms of the geometry of the Bayes risk. Although all calculations are for proper losses, we also show how to carry the results across to improper losses. Keywords: mixability, multiclass, prediction with expert advice, proper loss, learning rates
6 0.4651075 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
7 0.4619942 111 jmlr-2012-Structured Sparsity and Generalization
8 0.43885434 82 jmlr-2012-On the Necessity of Irrelevant Variables
9 0.4123913 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
10 0.375875 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
11 0.36684716 73 jmlr-2012-Multi-task Regression using Minimal Penalties
12 0.35723874 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
13 0.32585603 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
14 0.32254487 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package
15 0.29073259 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
16 0.27983394 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
17 0.25456691 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
18 0.2513285 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
19 0.23836434 54 jmlr-2012-Large-scale Linear Support Vector Regression
topicId topicWeight
[(7, 0.067), (21, 0.066), (26, 0.046), (27, 0.014), (29, 0.034), (35, 0.021), (49, 0.025), (56, 0.015), (59, 0.022), (64, 0.028), (69, 0.012), (75, 0.072), (77, 0.024), (79, 0.022), (85, 0.252), (92, 0.09), (96, 0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.74548447 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
2 0.57691717 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
3 0.56322193 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
4 0.56186986 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
5 0.55530965 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
6 0.54600132 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
7 0.54412669 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
8 0.53963447 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
9 0.53878063 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
10 0.53672886 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
11 0.53589404 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
12 0.53439409 73 jmlr-2012-Multi-task Regression using Minimal Penalties
13 0.53433126 103 jmlr-2012-Sampling Methods for the Nyström Method
14 0.53289533 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
15 0.53169316 111 jmlr-2012-Structured Sparsity and Generalization
16 0.53137374 80 jmlr-2012-On Ranking and Generalization Bounds
17 0.52940714 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
18 0.52900696 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
19 0.52886903 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
20 0.52663445 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems