jmlr jmlr2010 jmlr2010-5 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph
Abstract: We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2 -regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-finding component of our algorithm to L1 -regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available. Keywords: BFGS, variable metric methods, Wolfe conditions, subgradient, risk minimization, hinge loss, multiclass, multilabel, bundle methods, BMRM, OCAS, OWL-QN
Reference: text
sentIndex sentText sentNum sentScore
1 To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search algorithm. [sent-16, score-0.41]
2 We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. [sent-17, score-0.456]
3 u ¨ Y U , V ISHWANATHAN , G UNTER AND S CHRAUDOLPH ( ) c1 ∇J(wt )⊤ pt ∇J(wt )⊤ pt c2 ∇J(wt )⊤ pt 0 acceptable interval Figure 1: Geometric illustration of the Wolfe conditions (4) and (5). [sent-28, score-0.405]
4 function J : Rd → R and a current iterate wt ∈ Rd , BFGS forms a local quadratic model of J: 1 Qt (p) := J(wt ) + 2 p⊤Bt−1 p + ∇J(wt )⊤p , (1) where Bt ≻ 0 is a positive-definite estimate of the inverse Hessian of J, and ∇J denotes the gradient. [sent-29, score-0.348]
5 Minimizing Qt (p) gives the quasi-Newton direction pt := −Bt ∇J(wt ), (2) wt+1 = wt + ηt pt . [sent-30, score-0.618]
6 (3) which is used for the parameter update: The step size ηt > 0 is normally determined by a line search obeying the Wolfe (1969) conditions: J(wt+1 ) ≤ J(wt ) + c1 ηt ∇J(wt )⊤pt ⊤ ⊤ and ∇J(wt+1 ) pt ≥ c2 ∇J(wt ) pt (sufficient decrease) (4) (curvature) (5) with 0 < c1 < c2 < 1. [sent-31, score-0.399]
7 Given a descent direction pt , the Wolfe conditions ensure that (∀t) st⊤yt > 0 and hence B0 ≻ 0 =⇒ (∀t) Bt ≻ 0. [sent-35, score-0.291]
8 There have been some attempts to apply (L)BFGS directly to nonsmooth optimization problems, in the hope that they would perform well on nonsmooth functions that are convex and differentiable almost everywhere. [sent-38, score-0.44]
9 Therefore, subgradient-based approaches such as subgradient descent (Nedi´ and Bertsekas, 2000) or bundle methods (Joachims, 2006; Franc c and Sonnenburg, 2008; Teo et al. [sent-43, score-0.303]
10 This results in subBFGS and its memory-limited variant subLBFGS, two new subgradient quasi-Newton methods that are applicable to nonsmooth convex optimization problems. [sent-57, score-0.394]
11 In particular, we apply our algorithms to a variety of machine learning problems, exploiting knowledge about the subdifferential of the binary hinge loss and its generalizations to the multiclass and multilabel settings. [sent-58, score-0.456]
12 We describe a new efficient algorithm to identify the nonsmooth points of a one-dimensional pointwise maximum of linear functions in Section 5, then use it to develop an exact line search that extends our optimization algorithms to the multiclass and multilabel settings (Section 6). [sent-61, score-0.646]
13 Motivation The application of standard (L)BFGS to nonsmooth optimization is problematic since the quasiNewton direction generated at a nonsmooth point is not necessarily a descent direction. [sent-65, score-0.558]
14 3), even though e nonsmooth convex functions are differentiable everywhere except on a set of Lebesgue measure zero, it is unwise to just use a smooth optimizer on a nonsmooth convex problem under the assumption that “it should work almost surely. [sent-75, score-0.502]
15 1 A T OY E XAMPLE The following simple example demonstrates the problems faced by BFGS when working with a nonsmooth objective function, and how our subgradient BFGS (subBFGS) method (to be introduced in Section 3) with exact line search overcomes these problems. [sent-79, score-0.582]
16 0 x Figure 3: Left: the nonsmooth convex function (8); optimization trajectory of BFGS with inexact line search (center) and subBFGS (right) on this function. [sent-119, score-0.389]
17 A generally sensible strategy is to use an exact line search that finds the optimum along a given descent direction (cf. [sent-124, score-0.309]
18 If an arbitrary subgradient is supplied instead, the BFGS update (6) can produce a search direction which is not a descent direction, causing the next line search to fail. [sent-129, score-0.507]
19 As Figure 3 (right) shows, once the first iteration of subBFGS lands it on the hinge at x = 0, its direction-finding routine (Algorithm 2) finds a descent direction for the next step. [sent-132, score-0.329]
20 To show this, we apply LBFGS to L2 -regularized risk minimization problems (30) with binary hinge loss (31), a typical nonsmooth optimization problem encountered in machine learning. [sent-137, score-0.417]
21 6 100 101 CPU Seconds 102 Figure 4: Performance of subLBFGS (solid) and standard LBFGS with exact (dashed) and inexact (dotted) line search methods on sample L2 -regularized risk minimization problems with the binary (left and center) and multiclass hinge losses (right). [sent-149, score-0.467]
22 This is because using an exact line search on a nonsmooth objective function increases the chance of landing on nonsmooth points, a situation that standard BFGS (resp. [sent-153, score-0.628]
23 Here we used an efficient inexact line search that uses a caching scheme specifically designed for L2 -regularized hinge loss (cf. [sent-157, score-0.311]
24 One way to get around this is to force LBFGS to take a unit step along its search direction to escape this nonsmooth point. [sent-163, score-0.346]
25 This would reduce the quasi-Newton direction pt = −Bt gt , gt ∈ ∂J(wt ) to simply a scaled subgradient direction. [sent-169, score-0.656]
26 4 101 103 102 103 CPU Seconds 104 Figure 5: Performance of subLBFGS, GD, and subGD on sample L2 -regularized risk minimization problems with binary (left), multiclass (center), and multilabel (right) hinge losses. [sent-177, score-0.447]
27 uses a random subgradient for its parameter update, and an improved subgradient descent method (denoted subGD) whose parameter is updated in the direction produced by our direction-finding routine (Algorithm 2) with Bt = I. [sent-212, score-0.504]
28 This is beneficial for nonsmooth problems, where Hessian does not fully represent the overall curvature of the objective function. [sent-226, score-0.306]
29 Our modifications can be grouped into three areas, which we elaborate on in turn: generalizing the local quadratic model, finding a descent direction, and finding a step size that obeys a subgradient reformulation of the Wolfe conditions. [sent-229, score-0.29]
30 1 Generalizing the Local Quadratic Model Recall that BFGS assumes that the objective function J is differentiable everywhere so that at the current iterate wt it can construct a local quadratic model (1) of J(wt ). [sent-241, score-0.402]
31 To resolve the ambiguity, we could simply replace the gradient ∇J(wt ) in (1) with an arbitrary subgradient gt ∈ ∂J(wt ). [sent-243, score-0.299]
32 However, as will be discussed later, the resulting quasi-Newton direction pt := −Bt gt is not necessarily a descent direction. [sent-244, score-0.435]
33 |||p||| ≤ 1, p∈Rd where J ′ (wt , p) := lim η↓0 (12) J(wt + ηp) − J(wt ) η is the directional derivative of J at wt in direction p, and ||| · ||| is a norm defined on Rd . [sent-254, score-0.348]
34 In other words, the normalized steepest descent direction is the direction of bounded norm along which the maximum rate of decrease in the objective function value is achieved. [sent-255, score-0.365]
35 Therefore, our search direction is essentially an unnomalized steepest descent direction with respect to the weighted norm (14). [sent-264, score-0.359]
36 1), multiclass and multilabel hinge losses (Section 6), and L1 -regularized logistic loss (Section 8. [sent-273, score-0.418]
37 2 Finding a Descent Direction A direction pt is a descent direction if and only if g⊤pt < 0 ∀g ∈ ∂J(wt ) (Hiriart-Urruty and Lemar´ chal, 1993, Theorem VIII. [sent-277, score-0.369]
38 (16) g∈∂J(wt ) For a smooth convex function, the quasi-Newton direction (2) is always a descent direction because ∇J(wt )⊤ pt = −∇J(wt )⊤Bt ∇J(wt ) < 0 holds due to the positivity of Bt . [sent-281, score-0.431]
39 For nonsmooth functions, however, the quasi-Newton direction pt := −Bt gt for a given gt ∈ ∂J(wt ) may not fulfill the descent condition (16), making it impossible to find a step size η > 0 that obeys the Wolfe conditions (4, 5), thus causing a failure of the line search. [sent-282, score-0.873]
40 3 Subgradient Line Search Given the current iterate wt and a search direction pt , the task of a line search is to find a step size η > 0 which reduces the objective function value along the line wt + ηpt : minimize Φ(η) := J(wt + ηpt ). [sent-312, score-1.117]
41 The curvature condition (24), on the other hand, is always satisfied by η∗ , as long as pt is a descent direction (16): sup g ′⊤ pt = sup g ≥ 0 > sup g⊤pt g′ ∈J(wt +η∗ pt ) g∈∂Φ(η∗ ) because 0 ∈ ∂ Φ(η∗ ). [sent-327, score-0.8]
42 1157 g∈∂J(wt ) ¨ Y U , V ISHWANATHAN , G UNTER AND S CHRAUDOLPH ( ) c1 sup g ⊤ pt g∈∂J(wt ) inf g ⊤ pt c2 sup g ⊤ pt g∈∂J(wt ) g∈∂J(wt ) sup g ⊤ pt g∈∂J(wt ) 0 acceptable interval Figure 8: Geometric illustration of the subgradient Wolfe conditions (23) and (24). [sent-328, score-0.902]
43 Extending this condition to nonsmooth functions, we require (wt+1 − wt )⊤ (gt+1 − gt ) > 0, where gt+1 ∈ ∂J(wt+1 ) and gt ∈ ∂J(wt ). [sent-332, score-0.759]
44 (25) If J is strongly convex,5 and wt+1 = wt , then (25) holds for any choice of gt+1 and gt . [sent-333, score-0.414]
45 To see this, we first use the fact that ηt pt = wt+1 − wt and ηt > 0 to rewrite (25) as pt⊤ gt+1 > pt⊤ gt , where gt+1 ∈ ∂J(wt+1 ) and gt ∈ ∂J(wt ). [sent-336, score-0.693]
46 The monotonic property of ∂Φ(η) given in Theorem 1 (below) ensures that pt⊤ gt+1 is no less than pt⊤ gt for any choice of gt+1 and gt , that is, inf pt⊤ g ≥ g∈∂J(wt+1 ) sup pt⊤ g. [sent-338, score-0.357]
47 (27) g∈∂J(wt ) This means that the only case where inequality (26) is violated is when both terms of (27) are equal, and gt+1 = arg inf g ⊤ pt and gt = arg sup g ⊤ pt , g∈∂J(wt+1 ) g∈∂J(wt ) that is, in this case pt⊤ gt+1 = pt⊤ gt . [sent-339, score-0.667]
48 Since BFGS “senses” the Hessian via (6) only through the parameter and gradient displacements st and yt , we can translate the bounds on the spectrum of Ht into conditions that only involve st and yt : (∀t) st⊤ yt y ⊤ yt 1 1 and t⊤ ≤ , with 0 < h ≤ H < ∞. [sent-351, score-0.445]
49 7), who applied the classical BFGS algorithm with a specially designed line search to nonsmooth functions. [sent-371, score-0.33]
50 L2 -regularized risk minimization with the binary hinge loss is a convex but nonsmooth optimization problem; in this section we show how subBFGS (Algorithm 1) can be applied to this problem. [sent-376, score-0.455]
51 Towards this end, we use (32) to obtain ⊤ sup g p = sup βi ,i∈M t g∈∂J(wt ) = wt⊤ p − ¯ 1 wt − ∑ βi zi xi ¯ n i∈M ⊤ p t 1 ∑ n i∈M inf (βi zi x⊤ p). [sent-393, score-0.508]
52 i t βi ∈[0,1] (33) Since for a given p the first term of the right-hand side of (33) is a constant, the supremum is attained when we set βi ∀i ∈ M t via the following strategy: 0 if zi x⊤ pt ≥ 0, i 1 if zi x⊤ pt < 0. [sent-394, score-0.39]
53 Right: The subgradient of Φ(η) increases monotonically with η, and jumps discontinuously at subdifferentiable points. [sent-399, score-0.298]
54 1162 Q UASI -N EWTON A PPROACH TO N ONSMOOTH C ONVEX O PTIMIZATION ( ) ( ) step size search direction step size search direction 0 0 target segment target segment Figure 11: Nonsmooth convex function Φ of step size η. [sent-411, score-0.328]
55 1 E XACT L INE S EARCH Given a direction p, exact line search finds the optimal step size η∗ := argminη≥0 Φ(η) that satisfies 0 ∈ ∂ Φ(η∗ ), or equivalently inf ∂ Φ(η∗ ) ≤ 0 ≤ sup ∂ Φ(η∗ ). [sent-415, score-0.3]
56 Segmenting the Pointwise Maximum of 1-D Linear Functions The line search of Algorithm 3 requires a vector η listing the subdifferentiable points along the line w + ηp, and sorts it in non-descending order (Line 5). [sent-424, score-0.334]
57 For an objective function like (30) whose nonsmooth component is just a sum of hinge losses (31), this vector is very easy to compute (cf. [sent-425, score-0.409]
58 In Section 6 we will then use this technique (Algorithm 4) to perform efficient exact line search in the multiclass and multilabel settings. [sent-439, score-0.41]
59 1 Multiclass Hinge Loss A variety of multiclass hinge losses have been proposed in the literature that generalize the binary hinge loss, and enforce a margin of separation between the true label zi and every other label. [sent-475, score-0.423]
60 Adapting (30) to the multiclass hinge loss (42) we obtain J(w) := λ w 2 2 + 1 n ∑ max [∆(z, zi) + f (w, xi, z) − f (w, xi, zi )]. [sent-480, score-0.314]
61 2 Efficient Multiclass Direction-Finding Oracle For L2 -regularized risk minimization with multiclass hinge loss, we can use a similar scheme as described in Section 4. [sent-486, score-0.293]
62 4 Multilabel Hinge Loss Recently, there has been interest in extending the concept of the hinge loss to multilabel problems. [sent-553, score-0.315]
63 For a uniform margin ∆(z′ , z) = τ ∀z′ = z our multilabel hinge loss (55) reduces to the decoupled version (54), which in turn reduces to the multiclass hinge loss (42) if Z i := {zi } for all i. [sent-557, score-0.579]
64 The subdifferential of the multilabel analogue of L2 -regularized multiclass objective (43) can then be written just as in (44), with coefficients βi,z := ∑ (i) γz′,z − z′ : (z′,z)∈Z ∗ i ∑ (i) γz,z′ , where (∀i) z′ : (z,z′ )∈Z ∗ i ∑ (z,z′ )∈Z ∗ i 1170 (i) (i) γz,z′ = 1 and γz,z′ ≥ 0. [sent-559, score-0.368]
65 3 for the multiclass hinge loss can be extended in a straightforward manner to our multilabel setting. [sent-563, score-0.418]
66 1 Nonsmooth Convex Optimization There are four main approaches to nonsmooth convex optimization: quasi-Newton methods, bundle methods, stochastic dual methods, and smooth approximation. [sent-574, score-0.366]
67 1 N ONSMOOTH Q UASI -N EWTON M ETHODS These methods try to find a descent quasi-Newton direction at every iteration, and invoke a line search to minimize the one-dimensional convex function along that direction. [sent-578, score-0.323]
68 In contrast, given a current iterate wt , our direction-finding routine (Algorithm 2) samples subgradients from the set ∂J(wt ) via the oracle. [sent-583, score-0.392]
69 From the optimization viewpoint this objective is very similar to L2 -regularized hinge loss; the direction finding and line search methods that we discussed in Sections 3. [sent-586, score-0.415]
70 Experiments We evaluated the performance of our subLBFGS algorithm with, and compared it to other state-ofthe-art nonsmooth optimization methods on L2 -regularized binary, multiclass, and multilabel hinge loss minimization problems. [sent-634, score-0.547]
71 The subgradient for the construction of the subLBFGS search direction (cf. [sent-652, score-0.3]
72 CPU seconds (bottom row) on sample L2 -regularized risk minimization problems with binary (left), multiclass (center), and multilabel (right) hinge losses. [sent-675, score-0.558]
73 In the case of multiclass and multilabel classification, where the structure of ∂J(w) is more complicated, we can see from Figure 13 (top center and right) that a better search direction can lead to faster convergence in terms of iteration numbers. [sent-680, score-0.423]
74 4 101 102 103 CPU Seconds 104 Figure 14: Performance of subLBFGS with varying buffer size on sample L2 -regularized risk minimization problems with binary (left), multiclass (center), and multilabel hinge losses (right). [sent-693, score-0.466]
75 3 L2 -Regularized Binary Hinge Loss For our first set of experiments, we applied subLBFGS with exact line search (Algorithm 3) to the task of L2 -regularized binary hinge loss minimization. [sent-706, score-0.314]
76 CPU seconds on L2 -regularized binary hinge loss minimization tasks. [sent-750, score-0.303]
77 CPU seconds taken to reduce the objective function to within 2% of the optimal value on L2 -regularized binary hinge loss minimization tasks. [sent-764, score-0.376]
78 An oracle that supplies arg supg∈∂J(w) g ⊤ p for this objective is easily constructed by noting that (58) is nonsmooth whenever at least one component of the parameter vector w is zero. [sent-775, score-0.318]
79 Using the variant of the multiclass and multilabel hinge loss which enforces a uniform margin of separation (∆(z, z′ ) = 1 ∀z = z′ ), we experimentally evaluated both algorithms on a number of publicly available data sets (Table 3). [sent-818, score-0.418]
80 CPU seconds on L2 -regularized multiclass hinge loss minimization tasks. [sent-867, score-0.406]
81 CPU seconds in L2 -regularized multilabel hinge loss minimization tasks. [sent-896, score-0.457]
82 The primary reason for this is that the exact line search used by ls-BMRM and subLBFGS requires substantially more computational effort in the multilabel than in the multiclass setting. [sent-905, score-0.41]
83 , its limited-memory variant), for handling nonsmooth convex optimization problems, and proved its global convergence in objective function value. [sent-911, score-0.333]
84 We applied our algorithm to a variety of machine learning problems employing the L2 -regularized binary hinge loss and its multiclass and multilabel generalizations, as well as L1 -regularized risk minimization with logistic loss. [sent-912, score-0.473]
85 Similarly, the line search, which is the expensive part of the computation on multiclass and multilabel problems, is easy to parallelize: The slaves run Algorithm 4 on subsets of the data; the results are fed back to the master which can then run Algorithm 5 to compute the step size. [sent-917, score-0.319]
86 We proposed such line searches for the multiclass hinge loss and its extension to the multilabel setting, based on a conceptually simple yet optimal algorithm to segment the pointwise maximum of lines. [sent-934, score-0.515]
87 This direction is a direction of descent as long as the last iterate p(i) fulfills the descent condition (16). [sent-1010, score-0.345]
88 Using gt ∈ ∂J(wt ) and the definition (7) ¯ of subgradient, we have (∀w) J(w) ≥ J(wt ) + (w − wt )⊤gt ¯ = J(wt+1 ) + (w − wt+1 )⊤gt + J(wt ) − J(wt+1 ) + (wt+1 − wt )⊤gt . [sent-1126, score-0.684]
89 ¯ ¯ Using wt+1 − wt = −ηt Bt gt and (108) gives ¯ c1 ηt ⊤ g Bt gt − ηt g⊤Bt gt ¯ ¯ ¯t ¯ 2 t = J(wt+1 ) + (w − wt+1 )⊤gt − εt′ , ¯ (∀w) J(w) ≥ J(wt+1 ) + (w − wt+1 )⊤gt + ¯ ¯ ¯t where εt′ := ηt (1 − c21 ) g⊤Bt gt . [sent-1127, score-0.846]
90 Under the third condition of Theorem 14, it follows from the first inequality in (84) in Corollary 4 that ¯ sup g⊤pt ≤ − 1 g⊤Bt gt , 2 ¯t (109) g∈∂J(wt ) where pt = −Bt gt , gt ∈ ∂J(wt ) is the search direction returned by Algorithm 2. [sent-1143, score-0.781]
91 Finally, to show (107) we use wt+1 − wt = −ηt Bt gt , the definition of the matrix norm: B := ¯ Bx maxx=0 x , and the upper bound on the spectrum of Bt to write: wt+1 − wt Recall that gt = gεt′ and at = ¯ c1 h β := 2H . [sent-1149, score-0.855]
92 = ηt Bt gt ¯ c1 ηt h 2 , ≤ ηt Bt gt ¯ ≤ ηt H gt . [sent-1150, score-0.432]
93 1 Counterexample for Steepest Descent The first counterexample (112) is given by Wolfe (1975) to show the non-convergent behaviour of the steepest descent method with an exact line search (denoted GD): f (x, y) := 5 (9x2 + 16y2 ) 9x + 16|y| if x ≥ |y|, otherwise. [sent-1157, score-0.324]
94 In fact, the second step of subBFGS lands exactly on the hinge x ≤ 0, y = 0, where a subgradient pointing to the optimum is available. [sent-1165, score-0.29]
95 (113) This example shows that steepest subgradient descent with an exact line search (denoted subGD) may not converge to the optimum of a nonsmooth function. [sent-1170, score-0.645]
96 parameters along the steepest descent subgradient direction, which is obtained by solving the minsup problem (13) with respect to the Euclidean norm. [sent-1173, score-0.291]
97 The zigzagging optimization trajectory of subGD does not allow it to land on any informative position such as the hinge y = 0, where the steepest subgradient descent direction points to the desired region (y < 0); Hiriart-Urruty and Lemar´ chal (1993, Section VIII. [sent-1176, score-0.554]
98 3 Counterexample for BFGS The final counterexample (114) is given by Lewis and Overton (2008b) to show that the standard BFGS algorithm with an exact line search can break down when encountering a nonsmooth point: f (x, y) := max{2|x| + y, 3y}. [sent-1181, score-0.389]
99 This is not surprising because at a nonsmooth point w the quasi-Newton direction p := −Bg for a given subgradient g ∈ ∂J(w) is not necessarily a direction of descent. [sent-1184, score-0.512]
100 Behavior of BFGS with an exact line search on nonsmooth examples. [sent-1321, score-0.354]
wordName wordTfidf (topN-words)
[('sublbfgs', 0.46), ('wt', 0.27), ('subbfgs', 0.241), ('bfgs', 0.222), ('nonsmooth', 0.201), ('bmrm', 0.194), ('onsmooth', 0.159), ('subgradient', 0.155), ('multilabel', 0.154), ('unter', 0.153), ('gt', 0.144), ('subdifferentiable', 0.143), ('pt', 0.135), ('hinge', 0.135), ('cpu', 0.135), ('bt', 0.134), ('ewton', 0.131), ('uasi', 0.131), ('chraudolph', 0.131), ('supg', 0.12), ('ishwanathan', 0.118), ('onvex', 0.114), ('seconds', 0.111), ('ptimization', 0.109), ('pproach', 0.105), ('wolfe', 0.103), ('multiclass', 0.103), ('ocas', 0.093), ('yt', 0.079), ('direction', 0.078), ('descent', 0.078), ('objective', 0.073), ('lbfgs', 0.073), ('ccat', 0.071), ('bundle', 0.07), ('sup', 0.069), ('search', 0.067), ('line', 0.062), ('inex', 0.061), ('steepest', 0.058), ('subgd', 0.055), ('lemar', 0.054), ('mt', 0.052), ('subgradients', 0.051), ('st', 0.051), ('gd', 0.051), ('chal', 0.05), ('zi', 0.05), ('birge', 0.044), ('covertype', 0.042), ('hessian', 0.04), ('piecewise', 0.039), ('convex', 0.038), ('routine', 0.038), ('subdifferential', 0.038), ('pointwise', 0.035), ('counterexample', 0.035), ('hinges', 0.035), ('lines', 0.033), ('iterate', 0.033), ('dual', 0.033), ('curvature', 0.032), ('obeys', 0.031), ('rd', 0.031), ('minimization', 0.031), ('tolerance', 0.03), ('nocedal', 0.029), ('kmax', 0.028), ('sub', 0.028), ('stack', 0.028), ('nding', 0.027), ('primal', 0.027), ('spectrum', 0.027), ('quadratic', 0.026), ('loss', 0.026), ('gao', 0.025), ('sorted', 0.025), ('risk', 0.024), ('exact', 0.024), ('franc', 0.024), ('oracle', 0.024), ('smooth', 0.024), ('overton', 0.022), ('secant', 0.022), ('teo', 0.022), ('wright', 0.022), ('convergence', 0.021), ('iterations', 0.021), ('slopes', 0.021), ('tsochantaridis', 0.021), ('fi', 0.021), ('qk', 0.021), ('inexact', 0.021), ('solid', 0.02), ('supremum', 0.02), ('arg', 0.02), ('inverse', 0.019), ('ow', 0.019), ('buffer', 0.019), ('minp', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph
Abstract: We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2 -regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-finding component of our algorithm to L1 -regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available. Keywords: BFGS, variable metric methods, Wolfe conditions, subgradient, risk minimization, hinge loss, multiclass, multilabel, bundle methods, BMRM, OCAS, OWL-QN
2 0.33278844 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le
Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E
3 0.22129446 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
Author: Lin Xiao
Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods
4 0.082962736 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines, document classification
5 0.064932592 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith
Abstract: The SGD-QN algorithm described in Bordes et al. (2009) contains a subtle flaw that prevents it from reaching its design goals. Yet the flawed SGD-QN algorithm has worked well enough to be a winner of the first Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008). This document clarifies the situation, proposes a corrected algorithm, and evaluates its performance. Keywords: stochastic gradient descent, support vector machine, conditional random fields
6 0.061791591 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
7 0.059251495 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
8 0.058461756 44 jmlr-2010-Graph Kernels
9 0.053807184 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
10 0.053299814 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
11 0.050688554 72 jmlr-2010-Matrix Completion from Noisy Entries
12 0.050373856 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
13 0.044569433 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
14 0.044321802 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
15 0.04347533 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
16 0.043382637 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
17 0.038175162 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
18 0.037881609 25 jmlr-2010-Composite Binary Losses
19 0.037844293 84 jmlr-2010-On Spectral Learning
20 0.037566043 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
topicId topicWeight
[(0, -0.199), (1, -0.19), (2, 0.24), (3, -0.038), (4, 0.474), (5, 0.04), (6, -0.036), (7, -0.031), (8, -0.07), (9, -0.005), (10, -0.176), (11, -0.048), (12, -0.095), (13, 0.002), (14, 0.012), (15, -0.015), (16, -0.1), (17, 0.116), (18, -0.028), (19, 0.05), (20, 0.052), (21, -0.032), (22, 0.028), (23, -0.062), (24, -0.002), (25, -0.017), (26, 0.007), (27, -0.006), (28, -0.015), (29, -0.04), (30, -0.013), (31, 0.066), (32, -0.038), (33, -0.028), (34, -0.0), (35, -0.069), (36, 0.029), (37, 0.067), (38, 0.02), (39, -0.032), (40, -0.015), (41, 0.024), (42, 0.014), (43, 0.087), (44, -0.056), (45, -0.048), (46, 0.017), (47, 0.025), (48, -0.116), (49, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.95236111 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph
Abstract: We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2 -regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-finding component of our algorithm to L1 -regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available. Keywords: BFGS, variable metric methods, Wolfe conditions, subgradient, risk minimization, hinge loss, multiclass, multilabel, bundle methods, BMRM, OCAS, OWL-QN
2 0.92002398 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le
Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E
3 0.85382402 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
Author: Lin Xiao
Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods
4 0.4533681 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith
Abstract: The SGD-QN algorithm described in Bordes et al. (2009) contains a subtle flaw that prevents it from reaching its design goals. Yet the flawed SGD-QN algorithm has worked well enough to be a winner of the first Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008). This document clarifies the situation, proposes a corrected algorithm, and evaluates its performance. Keywords: stochastic gradient descent, support vector machine, conditional random fields
5 0.29291263 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines, document classification
6 0.25133345 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
7 0.23316638 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
8 0.23298864 84 jmlr-2010-On Spectral Learning
9 0.22176732 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
10 0.20392899 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
11 0.19707233 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
12 0.19687758 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
13 0.19146995 25 jmlr-2010-Composite Binary Losses
14 0.1909034 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
15 0.19002491 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
16 0.16464683 72 jmlr-2010-Matrix Completion from Noisy Entries
17 0.16369344 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
18 0.1608703 44 jmlr-2010-Graph Kernels
19 0.1554832 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
20 0.14907898 60 jmlr-2010-Learnability, Stability and Uniform Convergence
topicId topicWeight
[(1, 0.014), (3, 0.036), (4, 0.016), (8, 0.014), (21, 0.031), (24, 0.015), (32, 0.053), (36, 0.034), (37, 0.045), (75, 0.15), (81, 0.01), (82, 0.327), (85, 0.151), (96, 0.013)]
simIndex simValue paperId paperTitle
1 0.80806518 80 jmlr-2010-On-Line Sequential Bin Packing
Author: András György, Gábor Lugosi, György Ottucsàk
Abstract: We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items. Special attention is payed to reference strategies that use a fixed threshold at each step to decide whether a new bin is opened. Some positive and negative results are presented for this case. Keywords: bin packing, on-line learning, prediction with expert advice
same-paper 2 0.71416879 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph
Abstract: We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2 -regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-finding component of our algorithm to L1 -regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available. Keywords: BFGS, variable metric methods, Wolfe conditions, subgradient, risk minimization, hinge loss, multiclass, multilabel, bundle methods, BMRM, OCAS, OWL-QN
3 0.5725857 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le
Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E
4 0.56329292 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
Author: Antti Honkela, Tapani Raiko, Mikael Kuusela, Matti Tornio, Juha Karhunen
Abstract: Variational Bayesian (VB) methods are typically only applied to models in the conjugate-exponential family using the variational Bayesian expectation maximisation (VB EM) algorithm or one of its variants. In this paper we present an efficient algorithm for applying VB to more general models. The method is based on specifying the functional form of the approximation, such as multivariate Gaussian. The parameters of the approximation are optimised using a conjugate gradient algorithm that utilises the Riemannian geometry of the space of the approximations. This leads to a very efficient algorithm for suitably structured approximations. It is shown empirically that the proposed method is comparable or superior in efficiency to the VB EM in a case where both are applicable. We also apply the algorithm to learning a nonlinear state-space model and a nonlinear factor analysis model for which the VB EM is not applicable. For these models, the proposed algorithm outperforms alternative gradient-based methods by a significant margin. Keywords: variational inference, approximate Riemannian conjugate gradient, fixed-form approximation, Gaussian approximation
5 0.5565685 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
Author: Gavin C. Cawley, Nicola L. C. Talbot
Abstract: Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a nonnegligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds. Keywords: model selection, performance evaluation, bias-variance trade-off, selection bias, overfitting
6 0.54165763 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
7 0.54159904 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
8 0.53830296 102 jmlr-2010-Semi-Supervised Novelty Detection
9 0.53667039 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
10 0.53536427 22 jmlr-2010-Classification Using Geometric Level Sets
11 0.53484952 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
12 0.53352696 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
14 0.532637 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
15 0.52920949 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data
16 0.5287677 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
17 0.52844524 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
18 0.52754796 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
19 0.52372909 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
20 0.52316236 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM