jmlr jmlr2012 jmlr2012-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Trinh Minh Tri Do, Thierry Artières
Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random fields, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difficult and requires a stronger and more disputable assumption. Yet we provide experimental results on artificial test problems, and on five standard and difficult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk
Reference: text
sentIndex sentText sentNum sentScore
1 The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. [sent-12, score-0.69]
2 First, in Section 2 we provide background on the cutting plane technique and on bundle methods, and we describe two main existing extensions, the convex regularized bundle method (CRBM) and the non-convex bundle method (NBM). [sent-100, score-1.347]
3 Background on Cutting Plane and Bundle Methods We provide now some background on the cutting plane principle and on optimization methods that have been built on this idea for convex and non-convex objective functions. [sent-108, score-0.772]
4 For a given function f (w), a cutting plane cw′ (w) is a first-order Taylor approximation computed at a particular point w′ : f (w) ≈ cw′ (w) = f (w′ ) + aw′ , w − w′ where aw′ ∈ ∂ f (w′ ) is a subgradient of f at w′ . [sent-111, score-0.68]
5 Importantly, a cutting plane of a convex function f is an underestimator of f . [sent-120, score-0.832]
6 3542 R EGULARIZED B UNDLE M ETHODS FOR C ONVEX AND N ON -C ONVEX R ISKS Figure 1: Basic approximation of a function f by a (underestimator) cutting plane at a point w′ (left), and a more accurate approximation by taking the maximum over many cutting planes of f (right). [sent-121, score-1.172]
7 In the case of a convex objective, any cutting plane of the objective f is an underestimator of f . [sent-124, score-0.896]
8 The idea of the cutting plane method is that one can build an accurate approximation function (named gt hereafter) of f , which is also an underestimator of f , as the maximum over many cutting plane approximation built at different points {w1 , . [sent-125, score-1.663]
9 , wt } as follows: f (w) ≈ gt (w) = max aw j , w + bw j . [sent-128, score-0.685]
10 The cutting plane method aims at iteratively building an increasingly accurate piecewise linear underestimator of the objective function by successively adding new cutting planes to the approximation g of f . [sent-133, score-1.37]
11 Every iteration, one adds a new cutting plane underestimator built at current solution, yielding a new piece-wise linear underestimator of f as in Equation 3. [sent-135, score-0.996]
12 t f (w j )] − gt (wt+1 ) 7: if gap < ε then return wt 8: end for Algorithm 2 Convex Regularized Bundle Method (CRBM) 1: Input: w1 , R, ε 2: Output: w∗ 3: for t = 1 to ∞ do 4: Compute awt and bwt of R at wt 5: wt∗ = argminw∈{w1 ,. [sent-146, score-1.095]
13 wt } f (w) ˜ 6: wt ← argminw gt (w) where gt (w) is defined as in Equation 6 ˜ 7: gapt = f (wt∗ ) − gt (wt ) ˜ 8: wt+1 = wt 9: if gapt < ε then return wt∗ 10: end for term. [sent-149, score-2.076]
14 The approximation function becomes: f (w) ≈ gt (w) = (w − wt )⊤ Ht (w − wt ) + max aw j , w + bw j j=1. [sent-150, score-1.164]
15 Note that this quadratic approximation of f (w) is more accurate than a cutting plane approximation on f (w). [sent-162, score-0.716]
16 Equation 5) CRBM is very similar to the cutting plane technique described before, where every iteration a new cutting plane approximation is built (at the current solution) and added to the current approximation function. [sent-165, score-1.384]
17 t (6) and the approximation problem is ˜ wt = argmin gt (w) = argmin w w λ w 2 2 + max aw j , w + bw j j=1. [sent-168, score-0.73]
18 t (7) where aw j , w + bw j is the approximation cutting plane of R built at w j , the solution at iteration j. [sent-170, score-0.868]
19 Importantly, if R(w) is convex then any cutting plane aw j , w + bw j is an underestimator of R(w), and its maximum, max j=1. [sent-171, score-0.935]
20 Let αt be the solution of the above dual problem at iteration t, the solution of the primal problem is given by: t ˜ wt = − αλAt , 1 ˜ gt (wt ) = − 2λ αt At 2 +α B . [sent-209, score-0.738]
21 Basically NBM works similarly as standard bundle methods by building iteratively an approximation function via the cutting plane technique. [sent-222, score-0.875]
22 The approximation problem is an instance of quadratic programming similar to the one in Equation 4, except that the raw cutting planes are adjusted to make sure that the approximation is a local underestimator of the objective function. [sent-228, score-0.844]
23 The set of cutting planes is expanded with the new cutting plane built at wt+1 . [sent-250, score-1.112]
24 Importantly, note that one gets more cutting planes in the bundle as the algorithm iterates, and such a ever increasing number of cutting planes may represent a potential problem wrt. [sent-252, score-1.17]
25 Usually to overcome such a problem, one uses an aggregated cutting plane in order to accumulate information of all cutting planes in previous iterations (Kiwiel, 1985). [sent-254, score-1.17]
26 For instance, one may keep a fixed number of cutting planes in the bundle Bt by removing the oldest cutting plane. [sent-256, score-1.073]
27 Then, the aggregated cutting plane allows preserving part of the information brought by removed cutting planes. [sent-257, score-1.057]
28 Then, the standard approximation function, which is defined as the maximum over a set of cutting plane approximations, is not an underestimator of the non-convex objective function anymore. [sent-261, score-0.896]
29 In addition although one may reasonably assume that a cutting plane built at a point w′ is an accurate approximation of f in a small region around w′ , such an approximation may become very poor for w far from w′ . [sent-262, score-0.728]
30 3547 ` D O AND A RTI E RES ✁ ✁ ✁ ✞✝✆☎✄✂ ✁ (a) Conflict (b) Bad approximation (c) Adjusting cutting plane Figure 3: Cutting planes and linearization errors. [sent-272, score-0.786]
31 The condition (9) ensures that if the linearization error, f (w′′ ) − cw′ (w′′ ), is negative then the cutting plane has to be lowered at least twice the amount that is required to have linearization error zero. [sent-274, score-0.696]
32 In other words, in the case of negative linearization error at w′′ , the cutting plane is adjusted so that the new linearization error is positive, with at least the same magnitude as the “old” negative linearization error. [sent-275, score-0.735]
33 Every iteration a new cutting plane is added to the bundle so that the size of the bundle at iteration t is t. [sent-283, score-1.148]
34 The extension of CRBM for non-convex function is not straightforward since, as we already observed when presenting NBM, the cutting plane approximation does not yield an underestimator of the objective function. [sent-293, score-0.896]
35 On one hand, we use standard techniques such as the introduction of locality measure and the adjustment of cutting planes in order to build local underestimator of the function at a given point. [sent-295, score-0.744]
36 Actually, the dual program of the approximation problem minimization in Equation 8 has a memory cost of O(tD + t 2 ) for storing all the cutting planes and the dot product matrix between cutting planes’ normal vectors (i. [sent-302, score-0.948]
37 It is based on the use of a cutting plane aggregation method which allows drastically limiting the number of CPs in the working set at the price of a less accurate underestimator approximation. [sent-312, score-0.821]
38 wt } f (w) 8: Jt ← UpdateWorkingSet(Jt−1 ,t, M) ˜ ˜ 9: [wt , ct ] ← Minimize gt (w) in Equation 11 ˜ 10: gapt = f (wt∗ ) − gt (wt ) 11: if gapt < ε then return wt∗ 12: end for 3. [sent-325, score-1.133]
39 1 Limited Memory for Convex Case Our goal here is to limit the number of cutting planes used in the approximation function, which can be done by removing some of the previous cutting planes if the number of cutting planes reaches a given limit. [sent-326, score-1.49]
40 Our proposal is to apply a similar technique to the set of cutting planes approximation of the risk function R, yielding an aggregated cutting plane. [sent-329, score-0.968]
41 4 Interestingly, we can show that if such an aggregated cutting plane is included in the approximation function, then one can remove any (or even all) previous cutting plane(s) while preserving the theoretical convergence rate O(1/λε) iterations of CRBM. [sent-330, score-1.142]
42 ,t} stands for a working set of active cutting plane indexes that we keep at iteration t ˜ ˜ and ct−1 (w) = at−1 , w + bt−1 is the aggregated cutting plane which accumulates information from ˜ previous cutting planes, c1 , . [sent-333, score-1.713]
43 At iteration t, a new cutting plane is added to the current set of cutting planes Jt−1 , but if Jt−1 is full (i. [sent-342, score-1.13]
44 3550 R EGULARIZED B UNDLE M ETHODS FOR C ONVEX AND N ON -C ONVEX R ISKS Figure 4: Quadratic underestimator of gt (w) (solid line) and corresponding aggregated cutting plane ct (w) (dash line)). [sent-348, score-1.1]
45 The use of an aggregated cutting plane is a key issue to limit storage requirements and computational effort per iteration. [sent-353, score-0.68]
46 At iteration t of Algorithm 3, the cutting plane aggregation ct (w) is derived from the mini˜ mization of gt (w). [sent-364, score-0.931]
47 We use the cutting plane technique to build an underestimator of gt (w) at its ˜ minimum wt = argminw gt (w). [sent-365, score-1.56]
48 3551 ` D O AND A RTI E RES Figure 4 illustrates the quadratic function (in red dash line) derived from the aggregated cutting plane at iteration t = 2. [sent-368, score-0.746]
49 The cutting plane ct (w) can be defined based on the dual solution of the ˜ approximation problem which may be characterized in primal and dual forms as follows: minw s. [sent-369, score-0.838]
50 ˜ ˜ Proof First, by construction we have wt = − at which implies that the derivative of λ ˜ ˜ ˜ is null at wt . [sent-388, score-0.89]
51 Actually: ˜ ˜ 2 ˜ 1 ˜ ˜ = − λ at 2 + bt gt (wt ) = − 2λ αt At 2 + αt Bt 2 λ ˜ ˜ at 2 λ λ at 2 2 − a , at + b ˜t ˜t ˜t ˜ ˜ = 2 wt = 2 λ −λ λ +b λ λ ˜ ˜ ˜ ˜ = 2 wt 2 + at , wt + bt . [sent-390, score-1.742]
52 λ 2 w 2 + c (w) ˜t (12) ˜ In other words, the quadratic function λ w 2 + ct (w) and the approximation function gt (w) reach 2 ˜ ˜ the same minimum value gt (w) at the same point wt . [sent-391, score-0.929]
53 Let ˜ 2 ˜ ˜ ht (w) = max max a j , w + b j , at−1 , w + bt−1 j=∈Jt be the piecewise linear approximation of R(w) at iteration t, we have: ˜ ˜ ˜ 0 ∈ ∂gt (wt ) ≡ λwt + ∂ht (wt ) ˜ ˜ ˜ since wt is the optimum solution of minimizing gt (w). [sent-393, score-0.729]
54 Furthermore, since ˜ ˜ gt (wt ) = λ wt 2 + ht (wt ), Equation 12 gives: 2 ˜ ˜ ˜ ˜ ˜ at , wt + bt = ht (wt ). [sent-396, score-1.202]
55 3552 R EGULARIZED B UNDLE M ETHODS FOR C ONVEX AND N ON -C ONVEX R ISKS ˜ The cutting plane ct (w) is then an underestimator of ht (w) built at wt (recall that ht (w) is convex), ˜ and thus λ w 2 + ct (w) is a quadratic underestimator of gt (w) = λ w 2 + ht (w). [sent-397, score-1.842]
56 Note that since ˜ 2 2 λ w 2 + ct (w) is an underestimator of gt (w) and gt (w) is an underestimator of f (w) at wt∗ , the ˜ 2 ˜ quadratic function λ w 2 + ct (w) is also an underestimator of f (w) at wt∗ . [sent-398, score-1.037]
57 We have to distinguish between a raw linear cutting plane of the risk cw j (with cw j (w) = aw j , w + bw j ) that is built at a particular iteration j of the algorithm and the eventually modified versions of this cutting plane that might be used in posterior iterations. [sent-407, score-1.581]
58 At iteration t we note ctj (with ctj (w) = a j , w + btj ) the cutting plane which is derived from cw j , the raw CP originally built at iteration j. [sent-409, score-0.938]
59 Similarly to non-convex bundle methods, we define a locality measure which is associated to any active cutting plane. [sent-416, score-0.69]
60 It is related to the locality measure between the cutting plane (actually the point where the cutting plane was built) and the best current observed solution. [sent-417, score-1.307]
61 We note stj the locality measure between cutting plane ctj and the best observed solution up to iteration t, wt∗ . [sent-418, score-0.873]
62 The full bundle information is: ˜t−1 ˜t−1 Bt = {ctj , stj } j∈Jt ∪ {ct , st } ˜t−1 where ct is an aggregated cutting plane and st is its locality measure to the best observed ˜t−1 ∗ . [sent-419, score-1.23]
63 1, the aggregated CP ct ˜t−1 solution wt can be viewed as a convex combination of CPs in previous iterations. [sent-421, score-0.678]
64 Note that at iteration t = 1, there is only one 1 cutting plane c1 and the aggregated cutting plane is also c1 : [c1 , s1 ] = [c1 , s1 ]. [sent-437, score-1.336]
65 coincide) with their corresponding locality mesures to the best solution w1 (s1 ˜ 1 Iteration t Every iteration the algorithm determine a new bundle Bt , the best observed solution up to iteration t, wt∗ , and the new current (and temporary) solution wt . [sent-440, score-0.945]
66 At iteration t > 1, few steps are successively performed: ˜ • Build a new cutting plane at wt−1 the minimizer of approximation function in previous iteration (gt−1 (w)). [sent-441, score-0.749]
67 , wt−1 ), we use a 3554 R EGULARIZED B UNDLE M ETHODS FOR C ONVEX AND N ON -C ONVEX R ISKS special aggregated cutting plane, ct for gathering information of previous cutting planes up ˜t−1 to iteration t − 1. [sent-451, score-1.064]
68 Note that a side effect of this minimization is the definition of a new aggregated cutting plane and its locality measure to the best observed solutions. [sent-456, score-0.771]
69 2 L OCALITY M EASURE AND C ONDITIONS ON CP S Given a set of cutting plane approximation of R, one could build a local underestimator of f in the vicinity of w by descending CPs that yields non positive linearization error of f at w. [sent-465, score-0.868]
70 We propose to define the locality measure between a cutting plane previously built at iteration j and the current best solution wt∗ based on the trajectory from w j to wt∗ . [sent-469, score-0.811]
71 (15) Unfortunately if a cutting plane is lowered too much, the minimum of the approximation function is not guaranteed to improve every iteration anymore. [sent-489, score-0.74]
72 It concerns the new added cutting plane only and writes: λ wt 2 + at , wt + bt ≥ f (wt∗ ). [sent-493, score-1.622]
73 In other words, we need t 2 to ensure that the approximation at wt using the new added cutting plane is greater or equal to the best observed function value. [sent-494, score-1.087]
74 The condition can be seen as a lower bound on the modified offset: λ bt ≥ f (wt∗ ) − wt 2 − at , wt . [sent-496, score-1.014]
75 It takes as input: • The bundle at previous iteration ∗ • The best observed solutions at previous iteration wt−1 ∗ • The best observed solutions at current iteration wt • The current solution wt and its corresponding raw cutting plane, cwt . [sent-500, score-1.701]
76 The algorithm is designed so that at the end of iteration t, all (|Jt | + 1) cutting planes in the bundle (i. [sent-501, score-0.744]
77 the |Jt | “normal” cutting planes and the aggregated cutting plane) satisfy condition in Equation 15 while the new added cutting plane ct also satisfies condition in Equation 16. [sent-503, score-1.624]
78 In the case of a descent step, condition (16) is trivially satisfied for the new added cutting plane since ct ≡ cwt . [sent-512, score-0.757]
79 A similar modification may be applied to the aggregated cutting plane: ˜ t−1 ˜ ˜ bt = min[bt−1 , R(wt∗ ) − at−1 , wt∗ − st ] ˜t−1 t−1 ∗ where st = st−1 + λ wt∗ − wt−1 2 . [sent-515, score-0.691]
80 the best solution at previous iteration, a conflict (if any) may only arise between the new cutting plane cwt and the best observed solution wt∗ . [sent-523, score-0.732]
81 Algorithm 6 modifies ct in such a way that it guarantees that the new cutting plane ct with t t parameters at and bt satisfies conditions (15) and (16). [sent-526, score-0.94]
82 Indeed conditions (15) and (16) may be rewritten as: t bt t bt t ≤ R(wt∗ ) − awt , wt∗ − st = U, t ≥ f (wt∗ ) − λ wt 2 − awt , wt = L 2 (17) which define an upper bound U and a lower bound L for bt . [sent-528, score-1.418]
83 The quadratic approximation corresponding to cutting plane cwt is plotted in orange, which is not a local underestimator of f (w) at wt∗ . [sent-533, score-0.906]
84 This t 2 quadratic function is defined so that it reaches its minimum at wt∗ and the linearization error of the cutting plane at , w + bt at wt∗ is λ wt − wt∗ 2 (see the orange quadratic curve in Figure 6 (bottomt 2 left)). [sent-542, score-1.306]
85 The new cutting plane is defined as: ct (w) t at bt t st t = at , w + bt , t = −λwt∗ , = f (wt∗ ) − λ wt 2 = λ wt − wt∗ 2 . [sent-543, score-1.909]
86 It also satisfies condition (15) as we show now: at , wt∗ + bt t = at , wt∗ + f (wt∗ ) − λ wt 2 − at , wt 2 = R(wt∗ ) + at , wt∗ − wt + λ ( wt∗ 2 − wt 2 = R(wt∗ ) + at + λ (wt∗ + wt ), wt∗ − wt 2 where we used the definition of the objective function f (wt∗ ) = −λwt∗ for at (Cf. [sent-545, score-2.814]
87 t = R(wt∗ ) − λ wt∗ − wt 2 2 = R(wt∗ ) − st t = R(wt∗ ) − at , wt∗ − λ wt∗ − wt 2 and condition in Equation 15 is satisfied. [sent-547, score-0.916]
88 3559 2) 2 Then, substituting ` D O AND A RTI E RES Figure 7: Quadratic underestimator of gt (w) derived from the aggregated cutting plane ct (w). [sent-548, score-1.1]
89 Figure 7 illustrates the quadratic function (in orange) derived from the aggregated cutting plane at iteration t = 2. [sent-552, score-0.746]
90 λ Hence the definition of the aggregated cutting plane follows: ˜ wt = − ˜ at ˜ bt = αt At , = αt Bt . [sent-571, score-1.26]
91 The aggregated CP ct accumulates in˜t formation from many cutting planes built at different points so that one cannot immediately define a ˜t ˜t locality measure st between ct and the current best observed solution wt∗ . [sent-573, score-0.935]
92 However, ct being a con˜t t as the corresponding convex combination vex combination of cutting planes, we chose to define st ˜ of locality measures associated to cutting planes: st = ˜t ˜ ∑ α j stj + αs˜tt−1 . [sent-574, score-1.127]
93 If the search direction is not a descent direction then the line search returns the best solution along the search direction (should be close to the current solution), which will be used to build a new cutting plane in the next iteration. [sent-597, score-0.696]
94 Proof We focus on deriving an underestimator on the minimum value of gt (w) based solely on this aggregated cutting plane and on the new added cutting plane at iteration t. [sent-640, score-1.686]
95 Note that this is possible since the aggregated cutting plane accumulates information about the approximation problem at previous iterations. [sent-642, score-0.725]
96 Hence the linear factor may be rewritten as: 2 ˜ a ˜ l = at−1 − at ,˜ t−1 + bt − bt−1 λ λ ˜ ˜ = at , wt + bt − at−1 , wt − bt−1 λ 2 + a ,w +b − λ w 2 − a ˜ ˜ t−1 , wt − bt−1 = 2 wt t t t t 2 = f (wt ) − gt−1 (wt ). [sent-649, score-2.028]
97 Proof We have gt (wt∗ ) = f (wt∗ ) since the approximation errors are zero at points where cutting plane ˜ were built. [sent-671, score-0.801]
98 Recall that there is a NullStep2 at iteration t if and only if the raw cutting plane built at current solution wt is not compatible with the best observed solution wt∗ . [sent-700, score-1.188]
99 ˜ Theorem 4 If gapt = 0 at iteration t of Algorithm 4, then wt∗ = wt and wt∗ is a stationary point of objective function f , i. [sent-714, score-0.726]
100 We built on ideas from Convex Regularized Bundle Methods and on ideas from Non-Convex Bundle Methods, exploiting the regularization term of the objective and using a particular design of the aggregated cutting plane to build limited memory variants. [sent-1710, score-0.846]
wordName wordTfidf (topN-words)
[('wt', 0.434), ('cutting', 0.377), ('nbm', 0.307), ('nrbmls', 0.291), ('plane', 0.231), ('bundle', 0.222), ('nrbm', 0.219), ('underestimator', 0.179), ('gapt', 0.155), ('gt', 0.148), ('bt', 0.146), ('eval', 0.116), ('lbfgs', 0.116), ('onvex', 0.112), ('crbm', 0.109), ('planes', 0.097), ('ct', 0.093), ('cw', 0.092), ('undle', 0.092), ('locality', 0.091), ('isks', 0.088), ('rti', 0.088), ('jt', 0.085), ('sgd', 0.082), ('res', 0.08), ('egularized', 0.075), ('aggregated', 0.072), ('cp', 0.07), ('objective', 0.064), ('cwt', 0.056), ('bw', 0.055), ('ethods', 0.052), ('cccp', 0.051), ('st', 0.048), ('iteration', 0.048), ('stj', 0.048), ('aw', 0.048), ('gap', 0.047), ('convex', 0.045), ('approximation', 0.045), ('ctj', 0.044), ('ict', 0.044), ('cps', 0.037), ('crf', 0.037), ('linearization', 0.036), ('svmstruct', 0.034), ('solution', 0.034), ('ocr', 0.034), ('memory', 0.034), ('aggregation', 0.034), ('awt', 0.032), ('kiwiel', 0.032), ('risks', 0.032), ('icts', 0.031), ('built', 0.03), ('epochs', 0.028), ('neurocrf', 0.028), ('universvm', 0.028), ('regularized', 0.028), ('chained', 0.027), ('tsvm', 0.027), ('subgradient', 0.027), ('sg', 0.025), ('stationary', 0.025), ('transductive', 0.025), ('optimization', 0.025), ('deep', 0.025), ('yes', 0.025), ('convergence', 0.024), ('btj', 0.024), ('offset', 0.024), ('reaches', 0.023), ('minimum', 0.023), ('primal', 0.022), ('null', 0.022), ('ht', 0.02), ('cputime', 0.02), ('makela', 0.02), ('stoping', 0.02), ('argminw', 0.02), ('limited', 0.02), ('reach', 0.02), ('margin', 0.019), ('ad', 0.019), ('lagrange', 0.019), ('hidden', 0.019), ('optimizers', 0.019), ('adjusted', 0.019), ('linesearch', 0.018), ('dual', 0.018), ('quadratic', 0.018), ('search', 0.018), ('regularization', 0.018), ('optimizer', 0.017), ('variant', 0.017), ('dedicated', 0.017), ('iterations', 0.016), ('cdhmm', 0.016), ('cessent', 0.016), ('lowered', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
Author: Trinh Minh Tri Do, Thierry Artières
Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random fields, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difficult and requires a stronger and more disputable assumption. Yet we provide experimental results on artificial test problems, and on five standard and difficult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk
2 0.14359429 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan
Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efficient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efficient, crowdsourcing
3 0.13732418 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
4 0.13244295 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
Author: Francesco Orabona, Luo Jie, Barbara Caputo
Abstract: In recent years there has been a lot of interest in designing principled classification algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance. In the domain of kernel methods, a principled way to use multiple features is the Multi Kernel Learning (MKL) approach. Here we present a MKL optimization algorithm based on stochastic gradient descent that has a guaranteed convergence rate. We directly solve the MKL problem in the primal formulation. By having a p-norm formulation of MKL, we introduce a parameter that controls the level of sparsity of the solution, while leading to an easier optimization problem. We prove theoretically and experimentally that 1) our algorithm has a faster convergence rate as the number of kernels grows; 2) the training complexity is linear in the number of training examples; 3) very few iterations are sufficient to reach good solutions. Experiments on standard benchmark databases support our claims. Keywords: multiple kernel learning, learning kernels, online optimization, stochastic subgradient descent, convergence bounds, large scale
5 0.12879138 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
Author: Zhuang Wang, Koby Crammer, Slobodan Vucetic
Abstract: Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Specifically, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction. Keywords: SVM, large-scale learning, online learning, stochastic gradient descent, kernel methods
6 0.1254724 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
7 0.11688011 97 jmlr-2012-Regularization Techniques for Learning with Matrices
8 0.070209369 84 jmlr-2012-Online Submodular Minimization
9 0.05916959 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
10 0.057498354 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
11 0.057021528 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
12 0.03800549 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
13 0.033803597 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
14 0.032828815 54 jmlr-2012-Large-scale Linear Support Vector Regression
15 0.032578494 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
16 0.027629625 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
17 0.027528008 3 jmlr-2012-A Geometric Approach to Sample Compression
18 0.026518432 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
19 0.025700379 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
20 0.025085483 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
topicId topicWeight
[(0, -0.172), (1, -0.245), (2, -0.0), (3, 0.085), (4, 0.074), (5, 0.088), (6, 0.217), (7, -0.083), (8, 0.13), (9, -0.03), (10, -0.027), (11, 0.377), (12, -0.026), (13, -0.016), (14, 0.033), (15, 0.098), (16, -0.015), (17, 0.13), (18, -0.026), (19, 0.079), (20, 0.039), (21, 0.036), (22, -0.131), (23, -0.168), (24, -0.205), (25, 0.039), (26, 0.093), (27, 0.006), (28, -0.07), (29, -0.009), (30, -0.034), (31, -0.049), (32, 0.077), (33, 0.04), (34, -0.069), (35, 0.037), (36, -0.003), (37, 0.027), (38, 0.053), (39, -0.032), (40, -0.0), (41, 0.027), (42, -0.014), (43, -0.038), (44, 0.029), (45, 0.079), (46, 0.017), (47, -0.032), (48, 0.033), (49, -0.094)]
simIndex simValue paperId paperTitle
same-paper 1 0.96909231 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
Author: Trinh Minh Tri Do, Thierry Artières
Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random fields, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difficult and requires a stronger and more disputable assumption. Yet we provide experimental results on artificial test problems, and on five standard and difficult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk
2 0.67764056 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
Author: Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller
Abstract: We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identified with near certainty, they can be conditioned upon, allowing further inference to be done efficiently. Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-specific training data. While OCR systems produce confidence measures for the identity of each letter or word, thresholding these values still produces a significant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difficult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-specific character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system’s translation.1 Keywords: optical character recognition, probability bounding, document-specific modeling, computer vision 1. This work is an expanded and revised version of Kae et al. (2010). Supported by NSF Grant IIS-0916555. c 2012 Gary B. Huang, Andrew Kae, Carl Doersch and Erik Learned-Miller. H UANG , K AE , D OERSCH AND L EARNED -M ILLER
3 0.67426962 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
Author: Zhuang Wang, Koby Crammer, Slobodan Vucetic
Abstract: Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Specifically, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction. Keywords: SVM, large-scale learning, online learning, stochastic gradient descent, kernel methods
4 0.59640968 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
5 0.49895665 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan
Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efficient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efficient, crowdsourcing
6 0.36718267 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
7 0.30181891 97 jmlr-2012-Regularization Techniques for Learning with Matrices
8 0.19224703 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
9 0.19038324 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
10 0.18422844 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
11 0.16933091 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
12 0.14726162 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
13 0.14451405 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
14 0.14436096 54 jmlr-2012-Large-scale Linear Support Vector Regression
15 0.13495108 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
16 0.13114572 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
17 0.12842265 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
18 0.12620135 3 jmlr-2012-A Geometric Approach to Sample Compression
19 0.12480195 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
20 0.12332091 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
topicId topicWeight
[(7, 0.018), (21, 0.038), (26, 0.057), (27, 0.429), (29, 0.031), (35, 0.023), (49, 0.011), (64, 0.018), (75, 0.093), (77, 0.017), (92, 0.056), (96, 0.086)]
simIndex simValue paperId paperTitle
1 0.97834682 90 jmlr-2012-Pattern for Python
Author: Tom De Smedt, Walter Daelemans
Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks
2 0.89861292 3 jmlr-2012-A Geometric Approach to Sample Compression
Author: Benjamin I.P. Rubinstein, J. Hyam Rubinstein
Abstract: The Sample Compression Conjecture of Littlestone & Warmuth has remained unsolved for a quarter century. While maximum classes (concept classes meeting Sauer’s Lemma with equality) can be compressed, the compression of general concept classes reduces to compressing maximal classes (classes that cannot be expanded without increasing VC dimension). Two promising ways forward are: embedding maximal classes into maximum classes with at most a polynomial increase to VC dimension, and compression via operating on geometric representations. This paper presents positive results on the latter approach and a first negative result on the former, through a systematic investigation of finite maximum classes. Simple arrangements of hyperplanes in hyperbolic space are shown to represent maximum classes, generalizing the corresponding Euclidean result. We show that sweeping a generic hyperplane across such arrangements forms an unlabeled compression scheme of size VC dimension and corresponds to a special case of peeling the one-inclusion graph, resolving a recent conjecture of Kuzmin & Warmuth. A bijection between finite maximum classes and certain arrangements of piecewise-linear (PL) hyperplanes in either a ball or Euclidean space is established. Finally we show that d-maximum classes corresponding to PL-hyperplane arrangements in Rd have cubical complexes homeomorphic to a d-ball, or equivalently complexes that are manifolds with boundary. A main result is that PL arrangements can be swept by a moving hyperplane to unlabeled d-compress any finite maximum class, forming a peeling scheme as conjectured by Kuzmin & Warmuth. A corollary is that some d-maximal classes cannot be embedded into any maximum class of VC-dimension d + k, for any constant k. The construction of the PL sweeping involves Pachner moves on the one-inclusion graph, corresponding to moves of a hyperplane across the intersection of d other hyperplanes. This extends the well known Pachner moves for triangulations to c
same-paper 3 0.74418426 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
Author: Trinh Minh Tri Do, Thierry Artières
Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random fields, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difficult and requires a stronger and more disputable assumption. Yet we provide experimental results on artificial test problems, and on five standard and difficult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk
4 0.38765106 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development
Author: Stephen Gould
Abstract: We present an open-source platform-independent C++ framework for machine learning and computer vision research. The framework includes a wide range of standard machine learning and graphical models algorithms as well as reference implementations for many machine learning and computer vision applications. The framework contains Matlab wrappers for core components of the library and an experimental graphical user interface for developing and visualizing machine learning data flows. Keywords: machine learning, graphical models, computer vision, open-source software
5 0.37219244 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
Author: Neil D. Lawrence
Abstract: We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random fields (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter fitting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE). We show that MEU and ALLE are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Finally the maximum likelihood perspective allows us to introduce a new approach to dimensionality reduction based on L1 regularization of the Gaussian random field via the graphical lasso.
6 0.36611757 106 jmlr-2012-Sign Language Recognition using Sub-Units
7 0.36243606 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
8 0.35783345 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
9 0.35537875 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
10 0.35497147 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
11 0.35173175 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
12 0.35066327 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
13 0.34838802 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity
14 0.34806684 44 jmlr-2012-Feature Selection via Dependence Maximization
15 0.34124285 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization
16 0.33965844 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
17 0.33852965 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
18 0.33570224 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
19 0.33554444 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
20 0.33540469 13 jmlr-2012-Active Learning via Perfect Selective Classification