jmlr jmlr2009 jmlr2009-27 knowledge-graph by maker-knowledge-mining

27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting


Source: pdf

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Keywords: subgradient methods, group sparsity, online learning, convex optimization 1. [sent-14, score-0.186]

2 A more general method than the above is the projected gradient method, which iterates f wt+1 = ΠΩ wt − ηt gt = argmin w∈Ω f w − (wt − ηt gt ) 2 2 where ΠΩ (w) is the Euclidean projection of w onto the set Ω. [sent-32, score-1.282]

3 (1), the iterates are simply wt+1 = wt − ηt gt − ηt gtr , r ∈ ∂r(w ). [sent-35, score-0.842]

4 A common problem in subgradient methods is that if r or f is non-differentiable, the where gt t iterates of the subgradient method are very rarely at the points of non-differentiability. [sent-36, score-0.513]

5 There is also a body of literature on regret analysis for online learning and online convex programming with convex constraints, which we build upon here (Zinkevich, 2003; Hazan et al. [sent-62, score-0.217]

6 We then extend the results to stochastic gradient descent and provide regret bounds for online convex programming in Sec. [sent-75, score-0.293]

7 Forward-Looking Subgradients and Forward-Backward Splitting Our approach to Forward-Backward Splitting is motivated by the desire to have the iterates wt attain points of non-differentiability of the function r. [sent-84, score-0.608]

8 Put informally, F OBOS can be viewed as analogous to the projected subgradient method while replacing or augmenting the projection step with an instantaneous minimization problem for which it is possible to derive a closed form solution. [sent-86, score-0.225]

9 F OBOS is succinct as each iteration consists of the following two steps: f wt+ 1 = wt − ηt gt , wt+1 = argmin 2 w (2) 1 w − wt+ 1 2 2 2 + ηt+ 1 r(w) 2 . [sent-87, score-0.838]

10 (3) f In the above, gt is a vector in ∂ f (wt ) and ηt is the step size at time step t of the algorithm. [sent-88, score-0.296]

11 The first step thus simply amounts to an unconstrained subgradient step with respect to the function f . [sent-90, score-0.178]

12 0∈∂ w − wt+ 1 + ηt+ 1 r(w) 2 2 2 w=wt+1 f Since wt+ 1 = wt − ηt gt , the above property amounts to 2 f 0 ∈ wt+1 − wt + ηt gt + ηt+ 1 ∂r(wt+1 ). [sent-97, score-1.647]

13 2901 D UCHI AND S INGER f The property 0 ∈ wt+1 − wt + ηt gt + ηt+ 1 ∂r(wt+1 ) implies that so long as we choose wt+1 to be the mini2 r mizer of Eq. [sent-101, score-0.815]

14 (3), we are guaranteed to obtain a vector gt+1 ∈ ∂r(wt+1 ) such that f r 0 = wt+1 − wt + ηt gt + ηt+ 1 gt+1 . [sent-102, score-0.815]

15 To recap, we can write wt+1 as f r wt+1 = wt − ηt gt − ηt+ 1 gt+1 , (5) 2 f r where gt ∈ ∂ f (wt ) and gt+1 ∈ ∂r(wt+1 ). [sent-104, score-1.067]

16 Second, the forward-looking gradient allows us to build on existing analyses and show that the resulting framework enjoys the formal convergence properties of many existing gradient-based and online convex programming algorithms. [sent-108, score-0.196]

17 However, the fact that F OBOS actually employs a forward-looking subgradient of the regularization function lets us build nicely on existing analyses. [sent-111, score-0.189]

18 As we show in the sequel, it is sufficient to set ηt+ 1 to ηt or ηt+1 , depending on whether we 2 2 are doing online or batch optimization, in order to obtain convergence and low regret bounds. [sent-114, score-0.169]

19 In this setting we use the subgradient of f , set ηt+ 1 = ηt+1 and 2 update wt to wt+1 as prescribed by Eq. [sent-116, score-0.721]

20 (2) and (3), then for 2 2 a constant c ≤ 4 and any vector w⋆ , 2ηt (1 − cAηt ) f (wt ) + 2ηt+ 1 (1 − cAηt+ 1 )r(wt+1 ) 2 2 ≤ 2ηt f (w⋆ ) + 2ηt+ 1 r(w⋆ ) + wt − w⋆ 2 2902 2 − wt+1 − w⋆ 2 + 7ηt ηt+ 1 G2 . [sent-133, score-0.563]

21 Note first that for some gt ∈ ∂ f (wt ) and gt+1 ∈ ∂r(wt+1 ), we have as in Eq. [sent-135, score-0.252]

22 (7) 2 f r The definition of a subgradient implies that for any gt+1 ∈ ∂r(wt+1 ) (and similarly for any gt ∈ ∂ f (wt ) with ⋆ )) f (wt ) and f (w r r(w⋆ ) ≥ r(wt+1 ) + gt+1 , w⋆ − wt+1 r ⇒ − gt+1 , wt+1 − w⋆ ≤ r(w⋆ ) − r(wt+1 ). [sent-137, score-0.369]

23 (7), we obtain r gt+1 , (wt+1 − wt ) f r r gt+1 , (−ηt gt − ηt+ 1 gt+1 ) 2 f r ηt+ 1 gt+1 + ηt gt 2 r gt+1 ≤ = ≤ r ηt+ 1 gt+1 2 + ηt 2 f r gt+1 gt ≤ ηt+ 1 Ar(wt+1 ) + G2 + ηt A max { f (wt ), r(wt+1 )} + G2 2 . [sent-139, score-1.319]

24 (10) We can bound the third term above by noting that f r ηt gt + ηt+ 1 gt+1 2 2 f 2 = ηt2 gt f 2 r r + 2ηt ηt+ 1 gt , gt+1 + ηt+ 1 gt+1 2 2 2 2 ≤ ηt2 A f (wt ) + 2ηt ηt+ 1 A max { f (wt ), r(wt+1 )} + ηt+ 1 A r(wt+1 ) + 4ηt2 G2 . [sent-142, score-0.756]

25 Theorem 2 Assume the following hold: (i) the norm of any subgradient from ∂ f and the norm of any subgradient from ∂r are bounded as in Eq. [sent-154, score-0.282]

26 2, assume that the norm of any subgradient from ∂ f and the norm of any subgradient from ∂r are bounded by G. [sent-182, score-0.282]

27 In particular, we can replace gt in f f the iterates of F OBOS with a stochastic estimate of the gradient gt , where E[gt ] ∈ ∂ f (wt ). [sent-200, score-0.646]

28 The lingering question is thus whether we can guarantee that such a set Ω exists and that our iterates wt remain in Ω. [sent-207, score-0.59]

29 We can easily project onto this box by truncating elements of wt lying outside it at any iteration without affecting the bounds in Eq. [sent-212, score-0.563]

30 The learning goal is for the sequence of predictions wt to attain low regret when compared to a single optimal predictor w⋆ . [sent-226, score-0.66]

31 a fixed predictor w⋆ while using a regularization function r is T R f +r (T ) = ∑ [ ft (wt ) + r(wt ) − ( ft (w⋆ ) + r(w⋆ ))] . [sent-237, score-0.212]

32 Theorem 6 Assume that wt − w⋆ ≤ D for all iterations and the norm of the subgradient sets ∂ ft and √ ∂r are bounded above by G. [sent-243, score-0.774]

33 (2006), with the projected gradient method and strongly convex functions, it is possible to achieve regret on the order of O(log T ) by using the curvature of the sequence of functions ft rather than simply using convexity and linearity as in Theorems 2 and 6. [sent-251, score-0.346]

34 A function f is called H-strongly convex if f (w) ≥ f (wt ) + ∇ f (wt ), w − wt + 2906 H w − wt 2 2 . [sent-254, score-1.159]

35 E FFICIENT L EARNING USING F ORWARD BACKWARD S PLITTING Thus, if r and the sequence of functions ft are strongly convex with constants H f ≥ 0 and Hr ≥ 0, we have H = H f + Hr and H-strong convexity gives f ft (wt ) − f (w⋆ ) + r(wt ) − r(w⋆ ) ≤ gt + gtr , wt − w⋆ − H wt − w⋆ 2 2 . [sent-255, score-1.566]

36 Theorem 8 Assume as in Theorem 6 that wt − w⋆ ≤ D and that ∂ ft and ∂r are bounded above by G. [sent-264, score-0.633]

37 Let r(w) = and the iterates wt generated by F OBOS satisfy wt ≤ G/λ. [sent-271, score-1.153]

38 Then w⋆ ≤ G/λ Using the regret analysis for online learning, we are able to return to learning in a batch setting and give stochastic convergence rates for F OBOS. [sent-273, score-0.205]

39 Indeed, suppose that on each step of F OBOS, we choose instead f f f of some gt ∈ ∂ f (wt ) a stochastic estimate of the gradient gt where E[gt ] ∈ ∂ f (wt ). [sent-277, score-0.641]

40 To simplify our derivations, we denote by v the vector wt+ 1 = 2 f ˜ wt − ηt g and let λ denote η 1 · λ. [sent-295, score-0.563]

41 This update can also be implemented very efficiently f when the support of gt is small (Langford et al. [sent-328, score-0.293]

42 Differentiating the above objective and setting ˜ the result equal to zero, we have w⋆ − v + λw⋆ = 0, which, using the original notation, yields the update f wt+1 = wt − ηt gt . [sent-333, score-0.878]

43 7 we briefly compare the resulting F OBOS update to modern stochastic gradient techniques and show that the F OBOS update exhibits similar empirical behavior. [sent-336, score-0.197]

44 Summarizing, the second step of the F OBOS update with ℓ2 regularization amounts to wt+1 = 1 − ˜ λ wt+ 1 2 wt+ 1 = 2 + 1− 2909 ˜ λ f f wt − ηt gt (wt − ηt gt ) . [sent-354, score-1.219]

45 + D UCHI AND S INGER f ˜ Thus, the ℓ2 regularization results in a zero weight vector under the condition that wt − ηt gt ≤ λ. [sent-355, score-0.904]

46 This condition is rather more stringent for sparsity than the condition for ℓ1 (where a weight is sparse based only ˜ on its value, while here, sparsity happens only if the entire weight vector has ℓ2 -norm less than λ), so it is unlikely to hold in high dimensions. [sent-356, score-0.272]

47 In words, 2 the components of the vector wt+1 are the result of capping all of the components of wt at θ where θ is ˜ zero when the 1-norm of wt+ 1 is smaller than λ. [sent-395, score-0.563]

48 Similar to ℓ1 regularization, Berhu (for inverse Huber) regularization results in sparse solutions, but its hybridization with ℓ2 regularization prevents the weights from being excessively large. [sent-402, score-0.168]

49 Efficient Implementation in High Dimensions f In many settings, especially online learning, the weight vector wt and the gradients gt reside in a very highf dimensional space, but only a relatively small number of the components of gt are non-zero. [sent-450, score-1.12]

50 The need to cope with gradient sparsity becomes further pronounced in mixed-norm problems, as a single component of the gradient may correspond to an entire row of W . [sent-452, score-0.265]

51 Updating the entire matrix because f a few entries of gt are non-zero is clearly undesirable. [sent-453, score-0.252]

52 The upshot of following proposition is that when gt is sparse, we can use lazy evaluation of the weight vectors and defer to later rounds the update of components of wt whose corresponding gradient entries are zero. [sent-456, score-0.987]

53 The algorithmic consequence of Proposition 11 is that it is possible to perform a lazy update on each iteration by omitting the terms of wt (or whole rows of the matrix Wt when using mixed-norms) that are f outside the support of gt , the gradient of the loss at iteration t. [sent-585, score-0.97]

54 The advantage of the lazy evaluation is pronounced when using mixed-norm regularization as it lets us avoid f updating entire rows so long as the row index corresponds to a zero entry of the gradient gt . [sent-589, score-0.421]

55 , we can get f O(s) updates of w when the gradient gt is s-sparse, that is, it has only s non-zero components. [sent-595, score-0.352]

56 To that end, we also evaluate specific instantiations of F OBOS with respect to several state-of-the-art optimizers and projected subgradient methods on different learning problems. [sent-598, score-0.187]

57 We compared different stochastic gradient estimators that were based on varying sample sizes: a single example, 10% of the training set, 20% of the training set, and deterministic gradient using the entire data set. [sent-624, score-0.22]

58 In all experiments with a regularization of the form 1 λ w 2 , we used 2 2 step sizes of the form ηt = 1/(λt) to achieve the logarithmic regret bound of Sec. [sent-714, score-0.213]

59 In our experiments, S PA RSA seemed to outperform F OBOS and the projected gradient methods when using full (deterministic) gradient information. [sent-749, score-0.228]

60 We therefore do not report results for the deterministic versions of F OBOS and projected gradient methods to avoid clutter in the figures. [sent-751, score-0.175]

61 From the plots, we see that stochastic F OBOS and projected gradient actually perform very well on the more correlated data, very quickly getting to within 10−2 of the optimal value, though after this they essentially jam. [sent-789, score-0.216]

62 From the figures, it is apparent that the stochastic methods, both F OBOS and projected gradient, exhibit very good initial performance but eventually lose to the coordinate descent method in terms of optimization speed. [sent-826, score-0.187]

63 As before, if one is willing to use full gradient information, S PA RSA seems a better choice than the deterministic counterpart of F OBOS and projected gradient algorithms. [sent-827, score-0.254]

64 We used ℓ1 /ℓ2 regularization and compared F OBOS, S PA RSA, coordinate descent, and projected gradient methods on this test (as well as stochastic gradient versions of F OBOS and projected gradient). [sent-852, score-0.451]

65 The results for deterministic F OBOS and projected gradient were similar to S PA RSA, so we do not present them. [sent-853, score-0.175]

66 As before, we used the ℓ1 /ℓ2 norm of the solution vectors for F OBOS, S PA RSA, and coordinate descent as the constrained value for the projected gradient method. [sent-855, score-0.232]

67 For each of the gradient methods, we estimated the diameter D and maximum gradient G as in the synthetic √ experiments, which led us to use a step size of ηt = 30/ t. [sent-856, score-0.198]

68 8, it is clear that the stochastic gradient methods (F OBOS and projected gradient) were significantly faster than any of the other methods. [sent-860, score-0.185]

69 Before the coordinate descent method has even visited every coordinate once, stochastic F OBOS (and similarly stochastic projected gradient) have attained the minimal test-set error. [sent-861, score-0.245]

70 We now change our focus from training time to the attained sparsity levels for multiclass classification with ℓ1 , ℓ1 /ℓ2 , and ℓ1 /ℓ∞ regularization on MNIST and the StatLog LandSat data set. [sent-906, score-0.212]

71 10, we show the test set error and sparsity level of W as a function of training time (100 times the number of single-example gradient calculations) for the ℓ1 -regularized multiclass logistic loss with 720 training examples. [sent-909, score-0.282]

72 However, the deterministic version increases the level of sparsity throughout its run, while the stochastic-gradient version has highly variable sparsity levels and does not give solutions as sparse as the deterministic counterpart. [sent-915, score-0.29]

73 We saw similar behavior when using stochastic versus deterministic projected gradient methods. [sent-916, score-0.211]

74 As yet, we do not have a compelling justification for the difference in sparsity levels between stochastic and deterministic gradient F OBOS. [sent-918, score-0.248]

75 Intuitively, then, we expect that stochastic gradient descent will result in more non-zero coefficients than the deterministic variant of F OBOS. [sent-930, score-0.171]

76 f We simulated updates to a weight vector w with a sparse gradient gt for different dimensions d of w ∈ Rd f f and different sparsity levels s for gt with card(gt ) = s. [sent-939, score-0.752]

77 1 True l1/l2 l1 l1/l∞ 0 10 20 30 40 50 60 70 80 90 100 Fobos Steps Figure 11: The sparsity patterns attained by F OBOS using mixed-norm and ℓ1 regularization for a multiclass logistic regression problem. [sent-994, score-0.257]

78 11 we illustrate the sparsity pattern (far left) of the weight vector that generated the data and color-coded maps of the sparsity patterns learned using F OBOS with the different regularization schemes. [sent-1010, score-0.303]

79 11, we see that both ℓ1 /ℓ2 and ℓ1 /ℓ∞ were capable of zeroing entire rows of parameters and often learned a sparsity pattern that was close to the sparsity pattern of the matrix that was used to generate the examples. [sent-1015, score-0.214]

80 The results demonstrate that F OBOS quickly selects the sparsity pattern of w⋆ , and the level of sparsity persists throughout its execution. [sent-1025, score-0.214]

81 16 Table 3: MNIST classification error versus sparsity For comparison of the different regularization approaches, we report in Table 2 and Table 3 the test set error as a function of row sparsity of the learned matrix W . [sent-1053, score-0.302]

82 Specifically, we described both offline convergence rates for arbitrary convex functions with regularization as well as regret bounds for online convex programming of regularized losses. [sent-1086, score-0.286]

83 Our approach provides a simple mechanism for solving online convex programs with many regularization functions, giving sparsity in parameters and different types of block or group regularization straightforwardly. [sent-1089, score-0.32]

84 We assume the generalized update would, loosely speaking, be analogous to nonlinear projected subgradient methods and the mirror descent (see, e. [sent-1097, score-0.258]

85 We believe the attainment and rate of sparsity when using stochastic gradient information, as suggested by the discussion of Fig. [sent-1101, score-0.222]

86 Online Regret Proofs Proof of Theorem 6 Looking at Lemma 1, we immediately see that if ∂ f and ∂r are bounded by G, ft (wt ) − ft (w⋆ ) + r(wt+1 ) − r(w⋆ ) ≤ 1 2ηt wt − w⋆ 2 − wt+1 − w⋆ 2 7 + G2 ηt . [sent-1107, score-0.703]

87 (36) to obtain that T ∑ ( ft (wt ) − ft (w⋆ ) + r(wt ) − r(w⋆ )) + r(wT +1 ) − r(w⋆ ) − r(w1 ) + r(w⋆ ) R f +r (T ) = t=1 T 1 t=1 2ηt ≤ GD + ∑ wt − w⋆ 2 − wt+1 − w⋆ 2 + 7G2 T ∑ ηt 2 t=1 since r(w) ≤ r(0) + G w ≤ GD. [sent-1109, score-0.703]

88 We can rewrite the above bound and see R f +r (T ) ≤ GD + ≤ GD + 1 w1 − w ⋆ 2η1 2 + 1 T ∑ wt − w⋆ 2 t=2 D2 D2 T 1 1 + ∑ ηt − ηt−1 2η1 2 t=2 + 1 1 − ηt ηt−1 2 + 7G2 T ∑ ηt 2 t=1 7G2 T ∑ ηt , 2 t=1 where we used again the bound on the distance of each wt to w⋆ for the last inequality. [sent-1110, score-1.126]

89 (17) from t = 1 to T and get T R f +r (T ) ≤ ∑ t=1 f gt + gtr , wt − w⋆ − ≤ 2GD + 1 T −1 1 wt − w⋆ ∑ 2 t=1 ηt ≤ 2GD + 1 T −1 ∑ wt − w⋆ 2 t=2 2 H wt − w⋆ 2 2 − 2 1 wt+1 − w⋆ ηt 2 − H wt − w⋆ 1 1 1 − −H + w1 − w ⋆ ηt ηt−1 η1 2 + 2 + 7G2 T −1 ∑ ηt 2 t=1 7G2 T −1 ∑ ηt . [sent-1116, score-3.067]

90 For the second part, assume that w0 = 0, and for our induction that wt satisfies wt ≤ G/λ. [sent-1122, score-1.126]

91 (20), f wt+1 = wt − ηt gt 1 + ληt f wt + ηt gt 1 + ληt ≤ ≤ G/λ + ηt G G(1 + ληt ) = . [sent-1124, score-1.63]

92 (38) Before proceeding with the proof of fast convergence rate, we would like to underscore the equivalence of the F OBOS update and the composite gradient mapping. [sent-1155, score-0.169]

93 Formally, minimizing m(v, w) with respect to w is completely equivalent to taking a F OBOS step with η = 1/L and v = wt . [sent-1156, score-0.585]

94 (38) we simply need to divide m(v, w) by L = 1/η, omit terms that solely depend on v = wt , and use f the fact that wt+ 1 = wt − ηt gt = v − ∇f (v)/L. [sent-1158, score-1.378]

95 (40) Now we consider the change in function value from wt to wt+1 for the F OBOS update with η = 1/L. [sent-1164, score-0.604]

96 To do this, we take an arbitrary optimal point w⋆ and restrict wt+1 to lie on the line between wt and w⋆ , which constrains the set of infimum values above and allows us to carefully control them. [sent-1165, score-0.563]

97 (39) and (40) we get that φ(wt+1 ) ≤ inf φ(w) + w L w − wt 2 2 L αw⋆ + (1 − α)wt − wt 2 ≤ α∈[0,1] inf φ(αw⋆ + (1 − α)wt ) + ≤ inf αφ(w⋆ ) + (1 − α)φ(wt ) + α∈[0,1] α2 L ⋆ w − wt 2 2 2 . [sent-1167, score-1.734]

98 Thus, all iterates of the method satisfy wt − w⋆ ≤ D. [sent-1172, score-0.59]

99 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-1189, score-0.25]

100 Accelerated projected gradient method for linear inverse problems with sparsity constraints. [sent-1228, score-0.256]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('obos', 0.651), ('wt', 0.563), ('gt', 0.252), ('subgradient', 0.117), ('sparsity', 0.107), ('inger', 0.101), ('uchi', 0.101), ('orward', 0.095), ('plitting', 0.095), ('rsa', 0.095), ('regret', 0.079), ('gradient', 0.079), ('regularization', 0.072), ('ft', 0.07), ('projected', 0.07), ('fficient', 0.067), ('duchi', 0.064), ('fobos', 0.062), ('berhu', 0.056), ('backward', 0.053), ('nesterov', 0.045), ('logistic', 0.045), ('update', 0.041), ('stochastic', 0.036), ('online', 0.036), ('operations', 0.035), ('subgradients', 0.034), ('pa', 0.034), ('sparsa', 0.034), ('stoch', 0.034), ('convergence', 0.033), ('convex', 0.033), ('mnist', 0.032), ('dictionary', 0.031), ('correlated', 0.031), ('lipschitz', 0.031), ('landsat', 0.03), ('descent', 0.03), ('coordinate', 0.029), ('hazan', 0.029), ('pegasos', 0.028), ('iterates', 0.027), ('ws', 0.027), ('tseng', 0.026), ('minimize', 0.026), ('deterministic', 0.026), ('earning', 0.024), ('norms', 0.024), ('sparse', 0.024), ('norm', 0.024), ('splitting', 0.023), ('argmin', 0.023), ('coord', 0.022), ('obozinski', 0.022), ('sentiment', 0.022), ('slackness', 0.022), ('step', 0.022), ('initial', 0.022), ('objective', 0.022), ('merits', 0.021), ('daubechies', 0.021), ('updates', 0.021), ('dual', 0.021), ('batch', 0.021), ('langford', 0.021), ('sizes', 0.02), ('experiments', 0.02), ('sequel', 0.019), ('wright', 0.019), ('sign', 0.019), ('complimentary', 0.019), ('patches', 0.019), ('synthetic', 0.018), ('reconstruction', 0.018), ('plot', 0.018), ('lazy', 0.018), ('attain', 0.018), ('multiclass', 0.018), ('hinge', 0.018), ('regularizer', 0.018), ('nonetheless', 0.018), ('defer', 0.017), ('zhao', 0.017), ('amounts', 0.017), ('weight', 0.017), ('meier', 0.017), ('proj', 0.017), ('loss', 0.017), ('qualitatively', 0.016), ('moderately', 0.016), ('test', 0.016), ('projection', 0.016), ('ht', 0.016), ('composite', 0.016), ('sensitivity', 0.016), ('uncorrelated', 0.015), ('inf', 0.015), ('convexity', 0.015), ('attained', 0.015), ('analyses', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

2 0.4057714 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari

Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent

3 0.16705456 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

4 0.13590513 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

Author: Vojtěch Franc, Sören Sonnenburg

Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classification, binary classification, large-scale learning, parallelization

5 0.11612451 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

6 0.10651518 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

7 0.10077687 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

8 0.10017141 6 jmlr-2009-An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity

9 0.077604167 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

10 0.073927164 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

11 0.064580671 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

12 0.059574205 90 jmlr-2009-Structure Spaces

13 0.056274295 13 jmlr-2009-Bounded Kernel-Based Online Learning

14 0.054157402 49 jmlr-2009-Learning Permutations with Exponential Weights

15 0.053099103 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

16 0.044573151 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

17 0.042695664 67 jmlr-2009-Online Learning with Sample Path Constraints

18 0.035597149 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

19 0.035047948 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

20 0.03439831 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.276), (1, 0.215), (2, 0.015), (3, -0.208), (4, 0.464), (5, 0.325), (6, -0.092), (7, -0.145), (8, 0.039), (9, 0.067), (10, -0.108), (11, -0.093), (12, -0.089), (13, 0.101), (14, 0.043), (15, 0.006), (16, -0.019), (17, 0.031), (18, -0.061), (19, 0.043), (20, -0.014), (21, -0.042), (22, -0.021), (23, 0.031), (24, -0.056), (25, -0.04), (26, -0.011), (27, -0.008), (28, 0.009), (29, -0.014), (30, -0.012), (31, -0.021), (32, -0.021), (33, 0.005), (34, 0.039), (35, -0.008), (36, 0.038), (37, -0.017), (38, 0.007), (39, -0.006), (40, 0.005), (41, -0.051), (42, -0.013), (43, -0.01), (44, 0.011), (45, -0.026), (46, -0.054), (47, -0.058), (48, -0.033), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97520584 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

2 0.91758019 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari

Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent

3 0.67625111 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

Author: Vojtěch Franc, Sören Sonnenburg

Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classification, binary classification, large-scale learning, parallelization

4 0.54152727 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

5 0.42567644 6 jmlr-2009-An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity

Author: Jose M. Peña, Roland Nilsson, Johan Björkegren, Jesper Tegnér

Abstract: We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisfies weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that for any undirected graph G there exists a strictly positive discrete probability distribution with the prescribed sample spaces that is faithful to G. We also report an algorithm that implements the graphical criterion and whose running time is considered to be at most O(n2 (e + n)) for n nodes and e edges. Finally, we illustrate how the graphical criterion can be used within bioinformatics to identify biologically meaningful gene dependencies. Keywords: graphical models, vertex separation, graphoids, weak transitivity, bioinformatics

6 0.32628697 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

7 0.27037206 23 jmlr-2009-Discriminative Learning Under Covariate Shift

8 0.25778407 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

9 0.23013304 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

10 0.22832911 49 jmlr-2009-Learning Permutations with Exponential Weights

11 0.21687014 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

12 0.18760733 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

13 0.16948389 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

14 0.16281535 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

15 0.16200712 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

16 0.154888 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

17 0.15389207 67 jmlr-2009-Online Learning with Sample Path Constraints

18 0.15016906 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

19 0.14239004 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

20 0.14087702 13 jmlr-2009-Bounded Kernel-Based Online Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.017), (11, 0.037), (19, 0.021), (21, 0.013), (26, 0.029), (38, 0.036), (42, 0.011), (47, 0.023), (52, 0.053), (55, 0.054), (58, 0.025), (66, 0.146), (68, 0.335), (90, 0.074), (96, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96904963 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

Author: Holger Höfling, Robert Tibshirani

Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables

2 0.89253116 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, Lawrence K. Saul

Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-definite programming, Mahalanobis distance, metric learning, multi-class classification, support vector machines

same-paper 3 0.83412606 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

4 0.64563662 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

Author: Jie Chen, Haw-ren Fang, Yousef Saad

Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method

5 0.63735223 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

6 0.61036175 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

7 0.58745801 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

8 0.58724457 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

9 0.56622469 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

10 0.5652622 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm    (Special Topic on Mining and Learning with Graphs and Relations)

11 0.55958843 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

12 0.55526078 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

13 0.55440104 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

14 0.55348825 38 jmlr-2009-Hash Kernels for Structured Data

15 0.55237633 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

16 0.5420441 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

17 0.54182255 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

18 0.53831315 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

19 0.53725421 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

20 0.53602988 13 jmlr-2009-Bounded Kernel-Based Online Learning