jmlr jmlr2010 jmlr2010-5 knowledge-graph by maker-knowledge-mining

5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Source: pdf

Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph

Abstract: We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identiﬁcation of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2 -regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efﬁcient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-ﬁnding component of our algorithm to L1 -regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available. Keywords: BFGS, variable metric methods, Wolfe conditions, subgradient, risk minimization, hinge loss, multiclass, multilabel, bundle methods, BMRM, OCAS, OWL-QN

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 To extend our algorithm to the multiclass and multilabel settings, we develop a new, efﬁcient, exact line search algorithm. [sent-16, score-0.41]

2 We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. [sent-17, score-0.456]

3 u ¨ Y U , V ISHWANATHAN , G UNTER AND S CHRAUDOLPH ( ) c1 ∇J(wt )⊤ pt ∇J(wt )⊤ pt c2 ∇J(wt )⊤ pt 0 acceptable interval Figure 1: Geometric illustration of the Wolfe conditions (4) and (5). [sent-28, score-0.405]

4 function J : Rd → R and a current iterate wt ∈ Rd , BFGS forms a local quadratic model of J: 1 Qt (p) := J(wt ) + 2 p⊤Bt−1 p + ∇J(wt )⊤p , (1) where Bt ≻ 0 is a positive-deﬁnite estimate of the inverse Hessian of J, and ∇J denotes the gradient. [sent-29, score-0.348]

5 Minimizing Qt (p) gives the quasi-Newton direction pt := −Bt ∇J(wt ), (2) wt+1 = wt + ηt pt . [sent-30, score-0.618]

6 (3) which is used for the parameter update: The step size ηt > 0 is normally determined by a line search obeying the Wolfe (1969) conditions: J(wt+1 ) ≤ J(wt ) + c1 ηt ∇J(wt )⊤pt ⊤ ⊤ and ∇J(wt+1 ) pt ≥ c2 ∇J(wt ) pt (sufﬁcient decrease) (4) (curvature) (5) with 0 < c1 < c2 < 1. [sent-31, score-0.399]

7 Given a descent direction pt , the Wolfe conditions ensure that (∀t) st⊤yt > 0 and hence B0 ≻ 0 =⇒ (∀t) Bt ≻ 0. [sent-35, score-0.291]

8 There have been some attempts to apply (L)BFGS directly to nonsmooth optimization problems, in the hope that they would perform well on nonsmooth functions that are convex and differentiable almost everywhere. [sent-38, score-0.44]

9 Therefore, subgradient-based approaches such as subgradient descent (Nedi´ and Bertsekas, 2000) or bundle methods (Joachims, 2006; Franc c and Sonnenburg, 2008; Teo et al. [sent-43, score-0.303]

10 This results in subBFGS and its memory-limited variant subLBFGS, two new subgradient quasi-Newton methods that are applicable to nonsmooth convex optimization problems. [sent-57, score-0.394]

11 In particular, we apply our algorithms to a variety of machine learning problems, exploiting knowledge about the subdifferential of the binary hinge loss and its generalizations to the multiclass and multilabel settings. [sent-58, score-0.456]

12 We describe a new efﬁcient algorithm to identify the nonsmooth points of a one-dimensional pointwise maximum of linear functions in Section 5, then use it to develop an exact line search that extends our optimization algorithms to the multiclass and multilabel settings (Section 6). [sent-61, score-0.646]

13 Motivation The application of standard (L)BFGS to nonsmooth optimization is problematic since the quasiNewton direction generated at a nonsmooth point is not necessarily a descent direction. [sent-65, score-0.558]

14 3), even though e nonsmooth convex functions are differentiable everywhere except on a set of Lebesgue measure zero, it is unwise to just use a smooth optimizer on a nonsmooth convex problem under the assumption that “it should work almost surely. [sent-75, score-0.502]

15 1 A T OY E XAMPLE The following simple example demonstrates the problems faced by BFGS when working with a nonsmooth objective function, and how our subgradient BFGS (subBFGS) method (to be introduced in Section 3) with exact line search overcomes these problems. [sent-79, score-0.582]

16 0 x Figure 3: Left: the nonsmooth convex function (8); optimization trajectory of BFGS with inexact line search (center) and subBFGS (right) on this function. [sent-119, score-0.389]

17 A generally sensible strategy is to use an exact line search that ﬁnds the optimum along a given descent direction (cf. [sent-124, score-0.309]

18 If an arbitrary subgradient is supplied instead, the BFGS update (6) can produce a search direction which is not a descent direction, causing the next line search to fail. [sent-129, score-0.507]

19 As Figure 3 (right) shows, once the ﬁrst iteration of subBFGS lands it on the hinge at x = 0, its direction-ﬁnding routine (Algorithm 2) ﬁnds a descent direction for the next step. [sent-132, score-0.329]

20 To show this, we apply LBFGS to L2 -regularized risk minimization problems (30) with binary hinge loss (31), a typical nonsmooth optimization problem encountered in machine learning. [sent-137, score-0.417]

21 6 100 101 CPU Seconds 102 Figure 4: Performance of subLBFGS (solid) and standard LBFGS with exact (dashed) and inexact (dotted) line search methods on sample L2 -regularized risk minimization problems with the binary (left and center) and multiclass hinge losses (right). [sent-149, score-0.467]

22 This is because using an exact line search on a nonsmooth objective function increases the chance of landing on nonsmooth points, a situation that standard BFGS (resp. [sent-153, score-0.628]

23 Here we used an efﬁcient inexact line search that uses a caching scheme speciﬁcally designed for L2 -regularized hinge loss (cf. [sent-157, score-0.311]

24 One way to get around this is to force LBFGS to take a unit step along its search direction to escape this nonsmooth point. [sent-163, score-0.346]

25 This would reduce the quasi-Newton direction pt = −Bt gt , gt ∈ ∂J(wt ) to simply a scaled subgradient direction. [sent-169, score-0.656]

26 4 101 103 102 103 CPU Seconds 104 Figure 5: Performance of subLBFGS, GD, and subGD on sample L2 -regularized risk minimization problems with binary (left), multiclass (center), and multilabel (right) hinge losses. [sent-177, score-0.447]

27 uses a random subgradient for its parameter update, and an improved subgradient descent method (denoted subGD) whose parameter is updated in the direction produced by our direction-ﬁnding routine (Algorithm 2) with Bt = I. [sent-212, score-0.504]

28 This is beneﬁcial for nonsmooth problems, where Hessian does not fully represent the overall curvature of the objective function. [sent-226, score-0.306]

29 Our modiﬁcations can be grouped into three areas, which we elaborate on in turn: generalizing the local quadratic model, ﬁnding a descent direction, and ﬁnding a step size that obeys a subgradient reformulation of the Wolfe conditions. [sent-229, score-0.29]

30 1 Generalizing the Local Quadratic Model Recall that BFGS assumes that the objective function J is differentiable everywhere so that at the current iterate wt it can construct a local quadratic model (1) of J(wt ). [sent-241, score-0.402]

31 To resolve the ambiguity, we could simply replace the gradient ∇J(wt ) in (1) with an arbitrary subgradient gt ∈ ∂J(wt ). [sent-243, score-0.299]

32 However, as will be discussed later, the resulting quasi-Newton direction pt := −Bt gt is not necessarily a descent direction. [sent-244, score-0.435]

33 |||p||| ≤ 1, p∈Rd where J ′ (wt , p) := lim η↓0 (12) J(wt + ηp) − J(wt ) η is the directional derivative of J at wt in direction p, and ||| · ||| is a norm deﬁned on Rd . [sent-254, score-0.348]

34 In other words, the normalized steepest descent direction is the direction of bounded norm along which the maximum rate of decrease in the objective function value is achieved. [sent-255, score-0.365]

35 Therefore, our search direction is essentially an unnomalized steepest descent direction with respect to the weighted norm (14). [sent-264, score-0.359]

36 1), multiclass and multilabel hinge losses (Section 6), and L1 -regularized logistic loss (Section 8. [sent-273, score-0.418]

37 2 Finding a Descent Direction A direction pt is a descent direction if and only if g⊤pt < 0 ∀g ∈ ∂J(wt ) (Hiriart-Urruty and Lemar´ chal, 1993, Theorem VIII. [sent-277, score-0.369]

38 (16) g∈∂J(wt ) For a smooth convex function, the quasi-Newton direction (2) is always a descent direction because ∇J(wt )⊤ pt = −∇J(wt )⊤Bt ∇J(wt ) < 0 holds due to the positivity of Bt . [sent-281, score-0.431]

39 For nonsmooth functions, however, the quasi-Newton direction pt := −Bt gt for a given gt ∈ ∂J(wt ) may not fulﬁll the descent condition (16), making it impossible to ﬁnd a step size η > 0 that obeys the Wolfe conditions (4, 5), thus causing a failure of the line search. [sent-282, score-0.873]

40 3 Subgradient Line Search Given the current iterate wt and a search direction pt , the task of a line search is to ﬁnd a step size η > 0 which reduces the objective function value along the line wt + ηpt : minimize Φ(η) := J(wt + ηpt ). [sent-312, score-1.117]

41 The curvature condition (24), on the other hand, is always satisﬁed by η∗ , as long as pt is a descent direction (16): sup g ′⊤ pt = sup g ≥ 0 > sup g⊤pt g′ ∈J(wt +η∗ pt ) g∈∂Φ(η∗ ) because 0 ∈ ∂ Φ(η∗ ). [sent-327, score-0.8]

42 1157 g∈∂J(wt ) ¨ Y U , V ISHWANATHAN , G UNTER AND S CHRAUDOLPH ( ) c1 sup g ⊤ pt g∈∂J(wt ) inf g ⊤ pt c2 sup g ⊤ pt g∈∂J(wt ) g∈∂J(wt ) sup g ⊤ pt g∈∂J(wt ) 0 acceptable interval Figure 8: Geometric illustration of the subgradient Wolfe conditions (23) and (24). [sent-328, score-0.902]

43 Extending this condition to nonsmooth functions, we require (wt+1 − wt )⊤ (gt+1 − gt ) > 0, where gt+1 ∈ ∂J(wt+1 ) and gt ∈ ∂J(wt ). [sent-332, score-0.759]

44 (25) If J is strongly convex,5 and wt+1 = wt , then (25) holds for any choice of gt+1 and gt . [sent-333, score-0.414]

45 To see this, we ﬁrst use the fact that ηt pt = wt+1 − wt and ηt > 0 to rewrite (25) as pt⊤ gt+1 > pt⊤ gt , where gt+1 ∈ ∂J(wt+1 ) and gt ∈ ∂J(wt ). [sent-336, score-0.693]

46 The monotonic property of ∂Φ(η) given in Theorem 1 (below) ensures that pt⊤ gt+1 is no less than pt⊤ gt for any choice of gt+1 and gt , that is, inf pt⊤ g ≥ g∈∂J(wt+1 ) sup pt⊤ g. [sent-338, score-0.357]

47 (27) g∈∂J(wt ) This means that the only case where inequality (26) is violated is when both terms of (27) are equal, and gt+1 = arg inf g ⊤ pt and gt = arg sup g ⊤ pt , g∈∂J(wt+1 ) g∈∂J(wt ) that is, in this case pt⊤ gt+1 = pt⊤ gt . [sent-339, score-0.667]

48 Since BFGS “senses” the Hessian via (6) only through the parameter and gradient displacements st and yt , we can translate the bounds on the spectrum of Ht into conditions that only involve st and yt : (∀t) st⊤ yt y ⊤ yt 1 1 and t⊤ ≤ , with 0 < h ≤ H < ∞. [sent-351, score-0.445]

49 7), who applied the classical BFGS algorithm with a specially designed line search to nonsmooth functions. [sent-371, score-0.33]

50 L2 -regularized risk minimization with the binary hinge loss is a convex but nonsmooth optimization problem; in this section we show how subBFGS (Algorithm 1) can be applied to this problem. [sent-376, score-0.455]

51 Towards this end, we use (32) to obtain ⊤ sup g p = sup βi ,i∈M t g∈∂J(wt ) = wt⊤ p − ¯ 1 wt − ∑ βi zi xi ¯ n i∈M ⊤ p t 1 ∑ n i∈M inf (βi zi x⊤ p). [sent-393, score-0.508]

52 i t βi ∈[0,1] (33) Since for a given p the ﬁrst term of the right-hand side of (33) is a constant, the supremum is attained when we set βi ∀i ∈ M t via the following strategy: 0 if zi x⊤ pt ≥ 0, i 1 if zi x⊤ pt < 0. [sent-394, score-0.39]

53 Right: The subgradient of Φ(η) increases monotonically with η, and jumps discontinuously at subdifferentiable points. [sent-399, score-0.298]

54 1162 Q UASI -N EWTON A PPROACH TO N ONSMOOTH C ONVEX O PTIMIZATION ( ) ( ) step size search direction step size search direction 0 0 target segment target segment Figure 11: Nonsmooth convex function Φ of step size η. [sent-411, score-0.328]

55 1 E XACT L INE S EARCH Given a direction p, exact line search ﬁnds the optimal step size η∗ := argminη≥0 Φ(η) that satisﬁes 0 ∈ ∂ Φ(η∗ ), or equivalently inf ∂ Φ(η∗ ) ≤ 0 ≤ sup ∂ Φ(η∗ ). [sent-415, score-0.3]

56 Segmenting the Pointwise Maximum of 1-D Linear Functions The line search of Algorithm 3 requires a vector η listing the subdifferentiable points along the line w + ηp, and sorts it in non-descending order (Line 5). [sent-424, score-0.334]

57 For an objective function like (30) whose nonsmooth component is just a sum of hinge losses (31), this vector is very easy to compute (cf. [sent-425, score-0.409]

58 In Section 6 we will then use this technique (Algorithm 4) to perform efﬁcient exact line search in the multiclass and multilabel settings. [sent-439, score-0.41]

59 1 Multiclass Hinge Loss A variety of multiclass hinge losses have been proposed in the literature that generalize the binary hinge loss, and enforce a margin of separation between the true label zi and every other label. [sent-475, score-0.423]

60 Adapting (30) to the multiclass hinge loss (42) we obtain J(w) := λ w 2 2 + 1 n ∑ max [∆(z, zi) + f (w, xi, z) − f (w, xi, zi )]. [sent-480, score-0.314]

61 2 Efﬁcient Multiclass Direction-Finding Oracle For L2 -regularized risk minimization with multiclass hinge loss, we can use a similar scheme as described in Section 4. [sent-486, score-0.293]

62 4 Multilabel Hinge Loss Recently, there has been interest in extending the concept of the hinge loss to multilabel problems. [sent-553, score-0.315]

63 For a uniform margin ∆(z′ , z) = τ ∀z′ = z our multilabel hinge loss (55) reduces to the decoupled version (54), which in turn reduces to the multiclass hinge loss (42) if Z i := {zi } for all i. [sent-557, score-0.579]

64 The subdifferential of the multilabel analogue of L2 -regularized multiclass objective (43) can then be written just as in (44), with coefﬁcients βi,z := ∑ (i) γz′,z − z′ : (z′,z)∈Z ∗ i ∑ (i) γz,z′ , where (∀i) z′ : (z,z′ )∈Z ∗ i ∑ (z,z′ )∈Z ∗ i 1170 (i) (i) γz,z′ = 1 and γz,z′ ≥ 0. [sent-559, score-0.368]

65 3 for the multiclass hinge loss can be extended in a straightforward manner to our multilabel setting. [sent-563, score-0.418]

66 1 Nonsmooth Convex Optimization There are four main approaches to nonsmooth convex optimization: quasi-Newton methods, bundle methods, stochastic dual methods, and smooth approximation. [sent-574, score-0.366]

67 1 N ONSMOOTH Q UASI -N EWTON M ETHODS These methods try to ﬁnd a descent quasi-Newton direction at every iteration, and invoke a line search to minimize the one-dimensional convex function along that direction. [sent-578, score-0.323]

68 In contrast, given a current iterate wt , our direction-ﬁnding routine (Algorithm 2) samples subgradients from the set ∂J(wt ) via the oracle. [sent-583, score-0.392]

69 From the optimization viewpoint this objective is very similar to L2 -regularized hinge loss; the direction ﬁnding and line search methods that we discussed in Sections 3. [sent-586, score-0.415]

70 Experiments We evaluated the performance of our subLBFGS algorithm with, and compared it to other state-ofthe-art nonsmooth optimization methods on L2 -regularized binary, multiclass, and multilabel hinge loss minimization problems. [sent-634, score-0.547]

71 The subgradient for the construction of the subLBFGS search direction (cf. [sent-652, score-0.3]

72 CPU seconds (bottom row) on sample L2 -regularized risk minimization problems with binary (left), multiclass (center), and multilabel (right) hinge losses. [sent-675, score-0.558]

73 In the case of multiclass and multilabel classiﬁcation, where the structure of ∂J(w) is more complicated, we can see from Figure 13 (top center and right) that a better search direction can lead to faster convergence in terms of iteration numbers. [sent-680, score-0.423]

74 4 101 102 103 CPU Seconds 104 Figure 14: Performance of subLBFGS with varying buffer size on sample L2 -regularized risk minimization problems with binary (left), multiclass (center), and multilabel hinge losses (right). [sent-693, score-0.466]

75 3 L2 -Regularized Binary Hinge Loss For our ﬁrst set of experiments, we applied subLBFGS with exact line search (Algorithm 3) to the task of L2 -regularized binary hinge loss minimization. [sent-706, score-0.314]

76 CPU seconds on L2 -regularized binary hinge loss minimization tasks. [sent-750, score-0.303]

77 CPU seconds taken to reduce the objective function to within 2% of the optimal value on L2 -regularized binary hinge loss minimization tasks. [sent-764, score-0.376]

78 An oracle that supplies arg supg∈∂J(w) g ⊤ p for this objective is easily constructed by noting that (58) is nonsmooth whenever at least one component of the parameter vector w is zero. [sent-775, score-0.318]

79 Using the variant of the multiclass and multilabel hinge loss which enforces a uniform margin of separation (∆(z, z′ ) = 1 ∀z = z′ ), we experimentally evaluated both algorithms on a number of publicly available data sets (Table 3). [sent-818, score-0.418]

80 CPU seconds on L2 -regularized multiclass hinge loss minimization tasks. [sent-867, score-0.406]

81 CPU seconds in L2 -regularized multilabel hinge loss minimization tasks. [sent-896, score-0.457]

82 The primary reason for this is that the exact line search used by ls-BMRM and subLBFGS requires substantially more computational effort in the multilabel than in the multiclass setting. [sent-905, score-0.41]

83 , its limited-memory variant), for handling nonsmooth convex optimization problems, and proved its global convergence in objective function value. [sent-911, score-0.333]

84 We applied our algorithm to a variety of machine learning problems employing the L2 -regularized binary hinge loss and its multiclass and multilabel generalizations, as well as L1 -regularized risk minimization with logistic loss. [sent-912, score-0.473]

85 Similarly, the line search, which is the expensive part of the computation on multiclass and multilabel problems, is easy to parallelize: The slaves run Algorithm 4 on subsets of the data; the results are fed back to the master which can then run Algorithm 5 to compute the step size. [sent-917, score-0.319]

86 We proposed such line searches for the multiclass hinge loss and its extension to the multilabel setting, based on a conceptually simple yet optimal algorithm to segment the pointwise maximum of lines. [sent-934, score-0.515]

87 This direction is a direction of descent as long as the last iterate p(i) fulﬁlls the descent condition (16). [sent-1010, score-0.345]

88 Using gt ∈ ∂J(wt ) and the deﬁnition (7) ¯ of subgradient, we have (∀w) J(w) ≥ J(wt ) + (w − wt )⊤gt ¯ = J(wt+1 ) + (w − wt+1 )⊤gt + J(wt ) − J(wt+1 ) + (wt+1 − wt )⊤gt . [sent-1126, score-0.684]

89 ¯ ¯ Using wt+1 − wt = −ηt Bt gt and (108) gives ¯ c1 ηt ⊤ g Bt gt − ηt g⊤Bt gt ¯ ¯ ¯t ¯ 2 t = J(wt+1 ) + (w − wt+1 )⊤gt − εt′ , ¯ (∀w) J(w) ≥ J(wt+1 ) + (w − wt+1 )⊤gt + ¯ ¯ ¯t where εt′ := ηt (1 − c21 ) g⊤Bt gt . [sent-1127, score-0.846]

90 Under the third condition of Theorem 14, it follows from the ﬁrst inequality in (84) in Corollary 4 that ¯ sup g⊤pt ≤ − 1 g⊤Bt gt , 2 ¯t (109) g∈∂J(wt ) where pt = −Bt gt , gt ∈ ∂J(wt ) is the search direction returned by Algorithm 2. [sent-1143, score-0.781]

91 Finally, to show (107) we use wt+1 − wt = −ηt Bt gt , the deﬁnition of the matrix norm: B := ¯ Bx maxx=0 x , and the upper bound on the spectrum of Bt to write: wt+1 − wt Recall that gt = gεt′ and at = ¯ c1 h β := 2H . [sent-1149, score-0.855]

92 = ηt Bt gt ¯ c1 ηt h 2 , ≤ ηt Bt gt ¯ ≤ ηt H gt . [sent-1150, score-0.432]

93 1 Counterexample for Steepest Descent The ﬁrst counterexample (112) is given by Wolfe (1975) to show the non-convergent behaviour of the steepest descent method with an exact line search (denoted GD): f (x, y) := 5 (9x2 + 16y2 ) 9x + 16|y| if x ≥ |y|, otherwise. [sent-1157, score-0.324]

94 In fact, the second step of subBFGS lands exactly on the hinge x ≤ 0, y = 0, where a subgradient pointing to the optimum is available. [sent-1165, score-0.29]

95 (113) This example shows that steepest subgradient descent with an exact line search (denoted subGD) may not converge to the optimum of a nonsmooth function. [sent-1170, score-0.645]

96 parameters along the steepest descent subgradient direction, which is obtained by solving the minsup problem (13) with respect to the Euclidean norm. [sent-1173, score-0.291]

97 The zigzagging optimization trajectory of subGD does not allow it to land on any informative position such as the hinge y = 0, where the steepest subgradient descent direction points to the desired region (y < 0); Hiriart-Urruty and Lemar´ chal (1993, Section VIII. [sent-1176, score-0.554]

98 3 Counterexample for BFGS The ﬁnal counterexample (114) is given by Lewis and Overton (2008b) to show that the standard BFGS algorithm with an exact line search can break down when encountering a nonsmooth point: f (x, y) := max{2|x| + y, 3y}. [sent-1181, score-0.389]

99 This is not surprising because at a nonsmooth point w the quasi-Newton direction p := −Bg for a given subgradient g ∈ ∂J(w) is not necessarily a direction of descent. [sent-1184, score-0.512]

100 Behavior of BFGS with an exact line search on nonsmooth examples. [sent-1321, score-0.354]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sublbfgs', 0.46), ('wt', 0.27), ('subbfgs', 0.241), ('bfgs', 0.222), ('nonsmooth', 0.201), ('bmrm', 0.194), ('onsmooth', 0.159), ('subgradient', 0.155), ('multilabel', 0.154), ('unter', 0.153), ('gt', 0.144), ('subdifferentiable', 0.143), ('pt', 0.135), ('hinge', 0.135), ('cpu', 0.135), ('bt', 0.134), ('ewton', 0.131), ('uasi', 0.131), ('chraudolph', 0.131), ('supg', 0.12), ('ishwanathan', 0.118), ('onvex', 0.114), ('seconds', 0.111), ('ptimization', 0.109), ('pproach', 0.105), ('wolfe', 0.103), ('multiclass', 0.103), ('ocas', 0.093), ('yt', 0.079), ('direction', 0.078), ('descent', 0.078), ('objective', 0.073), ('lbfgs', 0.073), ('ccat', 0.071), ('bundle', 0.07), ('sup', 0.069), ('search', 0.067), ('line', 0.062), ('inex', 0.061), ('steepest', 0.058), ('subgd', 0.055), ('lemar', 0.054), ('mt', 0.052), ('subgradients', 0.051), ('st', 0.051), ('gd', 0.051), ('chal', 0.05), ('zi', 0.05), ('birge', 0.044), ('covertype', 0.042), ('hessian', 0.04), ('piecewise', 0.039), ('convex', 0.038), ('routine', 0.038), ('subdifferential', 0.038), ('pointwise', 0.035), ('counterexample', 0.035), ('hinges', 0.035), ('lines', 0.033), ('iterate', 0.033), ('dual', 0.033), ('curvature', 0.032), ('obeys', 0.031), ('rd', 0.031), ('minimization', 0.031), ('tolerance', 0.03), ('nocedal', 0.029), ('kmax', 0.028), ('sub', 0.028), ('stack', 0.028), ('nding', 0.027), ('primal', 0.027), ('spectrum', 0.027), ('quadratic', 0.026), ('loss', 0.026), ('gao', 0.025), ('sorted', 0.025), ('risk', 0.024), ('exact', 0.024), ('franc', 0.024), ('oracle', 0.024), ('smooth', 0.024), ('overton', 0.022), ('secant', 0.022), ('teo', 0.022), ('wright', 0.022), ('convergence', 0.021), ('iterations', 0.021), ('slopes', 0.021), ('tsochantaridis', 0.021), ('fi', 0.021), ('qk', 0.021), ('inexact', 0.021), ('solid', 0.02), ('supremum', 0.02), ('arg', 0.02), ('inverse', 0.019), ('ow', 0.019), ('buffer', 0.019), ('minp', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph

2 0.33278844 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E

3 0.22129446 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods

4 0.082962736 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin

Abstract: Large-scale linear classiﬁcation is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difﬁculties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we ﬁrst broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efﬁcient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data. Keywords: L1 regularization, linear classiﬁcation, optimization methods, logistic regression, support vector machines, document classiﬁcation

5 0.064932592 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith

Abstract: The SGD-QN algorithm described in Bordes et al. (2009) contains a subtle ﬂaw that prevents it from reaching its design goals. Yet the ﬂawed SGD-QN algorithm has worked well enough to be a winner of the ﬁrst Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008). This document clariﬁes the situation, proposes a corrected algorithm, and evaluates its performance. Keywords: stochastic gradient descent, support vector machine, conditional random ﬁelds

6 0.061791591 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

7 0.059251495 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

8 0.058461756 44 jmlr-2010-Graph Kernels

9 0.053807184 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

10 0.053299814 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

11 0.050688554 72 jmlr-2010-Matrix Completion from Noisy Entries

12 0.050373856 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

13 0.044569433 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

14 0.044321802 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

15 0.04347533 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

16 0.043382637 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

17 0.038175162 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

18 0.037881609 25 jmlr-2010-Composite Binary Losses

19 0.037844293 84 jmlr-2010-On Spectral Learning

20 0.037566043 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.199), (1, -0.19), (2, 0.24), (3, -0.038), (4, 0.474), (5, 0.04), (6, -0.036), (7, -0.031), (8, -0.07), (9, -0.005), (10, -0.176), (11, -0.048), (12, -0.095), (13, 0.002), (14, 0.012), (15, -0.015), (16, -0.1), (17, 0.116), (18, -0.028), (19, 0.05), (20, 0.052), (21, -0.032), (22, 0.028), (23, -0.062), (24, -0.002), (25, -0.017), (26, 0.007), (27, -0.006), (28, -0.015), (29, -0.04), (30, -0.013), (31, 0.066), (32, -0.038), (33, -0.028), (34, -0.0), (35, -0.069), (36, 0.029), (37, 0.067), (38, 0.02), (39, -0.032), (40, -0.015), (41, 0.024), (42, 0.014), (43, 0.087), (44, -0.056), (45, -0.048), (46, 0.017), (47, 0.025), (48, -0.116), (49, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95236111 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph

2 0.92002398 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

3 0.85382402 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

4 0.4533681 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith

5 0.29291263 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin

6 0.25133345 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

7 0.23316638 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

8 0.23298864 84 jmlr-2010-On Spectral Learning

9 0.22176732 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

10 0.20392899 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

11 0.19707233 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

12 0.19687758 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

13 0.19146995 25 jmlr-2010-Composite Binary Losses

14 0.1909034 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

15 0.19002491 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

16 0.16464683 72 jmlr-2010-Matrix Completion from Noisy Entries

17 0.16369344 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

18 0.1608703 44 jmlr-2010-Graph Kernels

19 0.1554832 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

20 0.14907898 60 jmlr-2010-Learnability, Stability and Uniform Convergence

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.014), (3, 0.036), (4, 0.016), (8, 0.014), (21, 0.031), (24, 0.015), (32, 0.053), (36, 0.034), (37, 0.045), (75, 0.15), (81, 0.01), (82, 0.327), (85, 0.151), (96, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80806518 80 jmlr-2010-On-Line Sequential Bin Packing

Author: András György, Gábor Lugosi, György Ottucsàk

Abstract: We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not ﬁt, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a ﬁnite class of reference strategies for any sequence of items. Special attention is payed to reference strategies that use a ﬁxed threshold at each step to decide whether a new bin is opened. Some positive and negative results are presented for this case. Keywords: bin packing, on-line learning, prediction with expert advice

same-paper 2 0.71416879 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph

3 0.5725857 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

4 0.56329292 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

Author: Antti Honkela, Tapani Raiko, Mikael Kuusela, Matti Tornio, Juha Karhunen

Abstract: Variational Bayesian (VB) methods are typically only applied to models in the conjugate-exponential family using the variational Bayesian expectation maximisation (VB EM) algorithm or one of its variants. In this paper we present an efﬁcient algorithm for applying VB to more general models. The method is based on specifying the functional form of the approximation, such as multivariate Gaussian. The parameters of the approximation are optimised using a conjugate gradient algorithm that utilises the Riemannian geometry of the space of the approximations. This leads to a very efﬁcient algorithm for suitably structured approximations. It is shown empirically that the proposed method is comparable or superior in efﬁciency to the VB EM in a case where both are applicable. We also apply the algorithm to learning a nonlinear state-space model and a nonlinear factor analysis model for which the VB EM is not applicable. For these models, the proposed algorithm outperforms alternative gradient-based methods by a signiﬁcant margin. Keywords: variational inference, approximate Riemannian conjugate gradient, ﬁxed-form approximation, Gaussian approximation

5 0.5565685 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

Author: Gavin C. Cawley, Nicola L. C. Talbot

Abstract: Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneﬁcial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a nonnegligible variance introduces the potential for over-ﬁtting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-ﬁtting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-ﬁtting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-ﬁtting and hence are unreliable. We discuss methods to avoid over-ﬁtting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the ﬁndings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a ﬁnite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds. Keywords: model selection, performance evaluation, bias-variance trade-off, selection bias, overﬁtting

6 0.54165763 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

7 0.54159904 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

8 0.53830296 102 jmlr-2010-Semi-Supervised Novelty Detection

9 0.53667039 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

10 0.53536427 22 jmlr-2010-Classification Using Geometric Level Sets

11 0.53484952 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

12 0.53352696 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

13 0.5329923 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

14 0.532637 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

15 0.52920949 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

16 0.5287677 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

17 0.52844524 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

18 0.52754796 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

19 0.52372909 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

20 0.52316236 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM