jmlr jmlr2010 jmlr2010-31 knowledge-graph by maker-knowledge-mining

31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Source: pdf

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. [sent-7, score-0.44]

2 Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods 1. [sent-9, score-0.413]

3 In this paper, we develop a new class of online algorithms, the regularized dual averaging (RDA) methods, that can exploit the regularization structure more effectively in an online setting. [sent-15, score-0.365]

4 We also assume that f (w, z) is convex in w for each z, and it is subdifferentiable (a subgradient always exists) on dom Ψ. [sent-21, score-0.323]

5 Suppose at time t, we have the most up-to-date weight vector wt . [sent-55, score-0.512]

6 Whenever zt is available, we can evaluate the loss f (wt , zt ), and also a subgradient gt ∈ ∂ f (wt , zt ) (here ∂ f (w, z) denotes the subdifferential of f (w, z) with respect to w). [sent-56, score-0.447]

7 The SGD method takes the form wt+1 = ΠC wt − αt (gt + ξt ) , (3) where αt is an appropriate stepsize, ξt is a subgradient of ψ at wt , and ΠC (·) denotes Euclidean projection onto the set C . [sent-60, score-1.137]

8 In a stochastic online setting, each weight vector wt is a random variable that depends on {z1 , . [sent-63, score-0.654]

9 In this paper, we develop regularized dual averaging (RDA) methods that can exploit the structure of (1) more effectively in a stochastic online setting. [sent-95, score-0.295]

10 Essentially, at each iteration, this method minimizes the sum of three terms: a linear function obtained by averaging all previous subgradients (the dual average), the original regularization function Ψ(w), and an additional strongly convex regularization term (βt /t)h(w). [sent-97, score-0.419]

11 2 Regularized Online Optimization In online optimization, we use an online algorithm to generate a sequence of decisions wt , one at a time, for t = 1, 2, 3, . [sent-109, score-0.69]

12 The regret with respect to any ﬁxed decision w ∈ dom Ψ is t Rt (w) ∑ τ=1 t fτ (wτ ) + Ψ(wτ ) − ∑ fτ (w) + Ψ(w) . [sent-121, score-0.305]

13 (5) τ=1 As in the stochastic setting, the online algorithm can query a subgradient gt ∈ ∂ ft (wt ) at each step, and possibly use all previous information, to compute the next decision wt+1 . [sent-122, score-0.502]

14 It turns out that the √ simple subgradient method√ is well suited for online optimization: with a stepsize αt = Θ(1/ t), (3) it has a regret Rt (w) ≤ O( t) for all w ∈ dom Ψ (Zinkevich, 2003). [sent-123, score-0.512]

15 However, if the cost functions are strongly convex, say with convexity parameter σ, then the same algorithm with stepsize αt = 1/(σt) gives an O(lnt) regret bound (e. [sent-125, score-0.325]

16 Similar to the discussions on regularized stochastic learning, the online subgradient method (3) in general lacks the capability of exploiting the regularization structure. [sent-130, score-0.355]

17 In this paper, we show that the same RDA method (4) can effectively exploit such structure in an online setting, and ensure √ √ the O( t) regret bound with βt = Θ( t). [sent-131, score-0.287]

18 For strongly convex regularizations, setting βt = O(lnt) yields the improved regret bound O(lnt). [sent-132, score-0.312]

19 In this paper, we will ﬁrst establish regret bounds of the RDA method for solving online optimization problems, then use them to derive convergence rates for solving stochastic learning problems. [sent-134, score-0.456]

20 In Section 3, we present the precise regret bounds of the RDA method for solving regularized online optimization problems. [sent-140, score-0.335]

21 If h is strongly convex with modulus σ, then for any w ∈ dom h and u ∈ rint (dom h), h(w) ≥ h(u) + s, w − u + σ w − u 2, 2 ∀ s ∈ ∂h(u). [sent-159, score-0.326]

22 Then the negative entropy function + i=1 n h(w) = ∑ w(i) ln w(i) + ln n, (6) i=1 with dom h = Sn , is strongly convex with respect to · 1 with modulus 1 (see, e. [sent-166, score-0.365]

23 2548 R EGULARIZED D UAL AVERAGING M ETHODS Algorithm 1 Regularized dual averaging (RDA) method input: • an auxiliary function h(w) that is strongly convex on dom Ψ and also satisﬁes arg min h(w) ∈ Arg min Ψ(w). [sent-175, score-0.469]

24 Given the function ft , compute a subgradient gt ∈ ∂ ft (wt ). [sent-182, score-0.405]

25 Update the average subgradient: gt = ¯ t −1 1 gt−1 + gt . [sent-184, score-0.446]

26 At the input to the RDA method, we need an auxiliary function h that is strongly convex on dom Ψ. [sent-191, score-0.286]

27 Step 1 is to compute a subgradient of ft at wt , which is standard for all subgradient or gradient based methods. [sent-198, score-0.767]

28 Step 2 is the online version of computing the average subgradient: gt = ¯ 1 t ∑ gτ . [sent-199, score-0.303]

29 1 RDA Methods with General Convex Regularization For a general convex regularization Ψ, we can choose any positive sequence {βt }t≥1 that is order √ √ √ exactly t, to obtain an O(1/ t) convergence rate for stochastic learning, or an O( t) regret bound for online optimization. [sent-208, score-0.495]

30 Regret Bounds for Online Optimization In this section, we give the precise regret bounds of the RDA method for solving regularized online optimization problems. [sent-265, score-0.335]

31 The convergence rates for stochastic learning problems can be established based on these regret bounds, and will be given in the next section. [sent-266, score-0.293]

32 • The function h(w) is strongly convex on dom Ψ, and subdifferentiable on rint (dom Ψ). [sent-271, score-0.319]

33 (12) This is true, for example, if dom Ψ is compact and each ft has Lipschitz-continuous gradient on dom Ψ. [sent-280, score-0.355]

34 ≤ ΓD − σr + 2 rt (17) In Theorem 1, the bounds on wt+1 − w 2 and gt ∗ depend on the regret Rt (w). [sent-293, score-0.519]

35 A smaller slack is equivalent to a larger regret Rt (w), which means w is a better ﬁxed solution for the online optimization problem (the best one gives the largest regret); correspondingly, the inequality (16) gives a tighter bound on wt+1 − w 2 . [sent-295, score-0.325]

36 rt • Consider the function Ψ(w) = σDKL (w|| p) with dom Ψ = Sn (assuming p ∈ rint Sn ). [sent-302, score-0.304]

37 1 Regret Bound with General Convex Regularization For a general convex regularization term Ψ, any nonnegative and nondecreasing sequence βt = √ √ Θ( t) gives an O( t) regret bound. [sent-312, score-0.337]

38 r t rt Proof To simplify regret analysis, let γ ≥ σ. [sent-321, score-0.296]

39 The regret bound in Corollary 2 is essentially the same as the online gradient descent method of √ Zinkevich (2003), which has the form (3), with the stepsize αt = 1/(γ t). [sent-326, score-0.348]

40 This result is still interesting because there is no special caution taken in the RDA method, more speciﬁcally in (8), to ensure the boundedness of the sequence wt . [sent-339, score-0.53]

41 If Ψ(w) is the indicator function of a closed convex set, then ΓD = 0 and part (c) shows that gt actually converges to zero √there exist an ¯ if interior w in FD such that Rt (w) ≥ 0. [sent-342, score-0.312]

42 2 Regret Bounds with Strongly Convex Regularization If the regularization function Ψ(w) is strongly convex, that is, with a convexity parameter σ > 0, then any nonnegative, nondecreasing sequence that satisﬁes βt ≤ O(lnt) will give an O(lnt) regret bound. [sent-346, score-0.358]

43 To see this, we notice that if Rt (u) and Rt (v) are nonnegative simultaneously for some t, then part (b) of Theorem 1 implies wt+1 − u 2 ≤O lnt t , wt+1 − v and 2 ≤O lnt t lnt t . [sent-359, score-0.52]

44 These rates and bounds are established not for the individual wt ’s generated by the RDA method, but rather for the primal average 1 t wt = ∑ wτ , ¯ t ≥ 1. [sent-365, score-1.082]

45 Then for any t ≥ 1, we have: (a) The expected cost associated with the random variable wt is bounded as ¯ 1 E φ(wt ) − φ⋆ ≤ ∆t . [sent-369, score-0.512]

46 2 rt 2556 R EGULARIZED D UAL AVERAGING M ETHODS Proof First, we substitute all fτ (·) by f (·, zτ ) in the deﬁnition of the regret Rt (w⋆ ) = t ∑ τ=1 t f (wτ , zτ ) + Ψ(wτ ) − ∑ f (w⋆ , zτ ) + Ψ(w⋆ ) . [sent-372, score-0.296]

47 Therefore, φ(wt ) ≥ φ(w⋆ ) + s, wt − w⋆ + σ wt − w⋆ 2 , 2 ∀ s ∈ ∂φ(w⋆ ). [sent-401, score-1.024]

48 Setting s = 0 in the above inequality and rearranging terms, we have wt − w⋆ 2 ≤ 2 (φ(wt ) − φ⋆ ) . [sent-403, score-0.541]

49 σ 2557 X IAO Taking expectation of both sides of the above inequality leads to E wt − w⋆ 2 2 2 (Eφ(wt ) − φ⋆ ) ≤ ∆t , σ σt ≤ (22) where in the last step we used part (a) of Theorem 3. [sent-404, score-0.541]

50 Next we take a closer look at the quantity E wt − w⋆ 2 . [sent-406, score-0.512]

51 By convexity of · 2 , we have ¯ E wt − w⋆ ¯ 2 ≤ 1 t ∑ E wτ − w⋆ t τ=1 2 (23) If σ = 0, then it is simply bounded by a constant because each E wτ − w⋆ 2 for 1 ≤ τ ≤ t is bounded by a constant. [sent-407, score-0.557]

52 When σ > 0, the optimal solution w⋆ is unique, and we have: Corollary 4 If Ψ is strongly convex with convexity parameter σ > 0 and βt = O(lnt), then E wt − w⋆ ¯ 2 ≤O (lnt)2 t . [sent-408, score-0.683]

53 Substituting the bound on ∆t in (20) into the inequality (22) gives E wt − w⋆ 2 ≤ (6 + lnt) G2 , t σ2 ∀t ≥ 1. [sent-410, score-0.564]

54 Then by (23), E wt − w⋆ ¯ 2 ≤ 6 ln τ 1 t ∑ τ+ τ t τ=1 1 G2 1 ≤ 6(1 + lnt) + (lnt)2 2 σ t 2 G2 . [sent-411, score-0.546]

55 σ2 In other words, E wt − w⋆ 2 converges to zero with rate O((lnt)2 /t). [sent-412, score-0.512]

56 It follows that for any chosen accuracy ε and 0 < δ ≤ 1/e, the sample size (10GD)2 ln(1/δ) t≥ ε2 guarantees that, with probability at least 1 − δ, wt is an ε-optimal solution of the original stochastic ¯ optimization problem (1). [sent-435, score-0.604]

57 (27) However, with the addition of a general regularization term Ψ(w) as in (4), the convergence analysis √ and O( t) regret bound of the RDA method do not follow directly as corollaries of either Nesterov (2009) or Shalev-Shwartz and Singer (2006). [sent-454, score-0.289]

58 Shalev-Shwartz and Kakade (2009) extended the primal-dual framework of Shalev-Shwartz and Singer (2006) to strongly convex functions and obtained O(lnt) regret bound. [sent-456, score-0.289]

59 In an online setting, each iteration of the F OBOS method consists of the following two 2560 R EGULARIZED D UAL AVERAGING M ETHODS steps: wt+ 1 = wt − αt gt , 2 wt+1 = arg min w 1 w − wt+ 1 2 2 2 2 + αt Ψ(w) . [sent-473, score-0.887]

60 For easy comparison with the RDA method, we rewrite the F OBOS method in an equivalent form gt , w + Ψ(w) + wt+1 = arg min w 1 w − wt 2αt 2 2 . [sent-476, score-0.79]

61 Given the function ft , compute subgradient gt ∈ ∂ ft (wt ). [sent-496, score-0.405]

62 Compute the dual average t −1 1 gt = ¯ gt−1 + gt . [sent-498, score-0.492]

63 In this case, the Equation (8) becomes 2 wt+1 = arg min gt , w + λ w ¯ w = arg min w gt , w + λtRDA ¯ γ √ t 1 w 2 +ρ w 2 2 γ w 1+ √ w 2 , 2 2 t 1+ 1 √ where λtRDA = λ+γρ/ t. [sent-514, score-0.514]

64 In the experiments, we compare the (enhanced) ℓ1 -RDA method (Algorithm 2) with the SGD method (i) (i) (i) (i) wt+1 = wt − αt gt + λ sgn(wt ) , i = 1, . [sent-533, score-0.777]

65 2564 R EGULARIZED D UAL AVERAGING M ETHODS Right: K = 10 for TG, γρ = 25 for RDA Left: K = 1 for TG, ρ = 0 for RDA NNZs in wt (λ = 0. [sent-556, score-0.512]

66 To have a better understanding of the behaviors of the algorithms, we plot the number of nonzeros (NNZs) in wt in Figure 3. [sent-605, score-0.512]

67 It can be seen that the RDA method maintains a much more sparse wt than the other online algorithms. [sent-610, score-0.632]

68 While the TG method generates more sparse solutions than the SGD method when λ is large, the NNZs in wt oscillates with a very big range. [sent-611, score-0.59]

69 Consider the following problem with two separate parts in the objective function: minimize w f (w) + Ψ(w) (32) where the function f is convex and differentiable on dom Ψ, its gradient ∇ f (w) is Lipschitzcontinuous with constant L, and the function Ψ is a closed proper convex function. [sent-724, score-0.302]

70 For online convex optimization problems, the regret bound can be improved to O(lnt) (Hazan et al. [sent-740, score-0.363]

71 At each iteration t ≥ 1, it computes three primal vectors ut , vt , wt , and a dual vector gt . [sent-772, score-1.166]

72 Among them, ut is the point for querying a stochastic gradient, gt is an weighted ˜ ˜ average of all past stochastic gradients, vt is the solution of an auxiliary minimization problem that involves gt and the regularization term Ψ(w), and wt is the output vector. [sent-773, score-1.5]

73 The computational effort ˜ 2572 R EGULARIZED D UAL AVERAGING M ETHODS Algorithm 3 Accelerated RDA method Input: • a strongly convex function h(w) with modulus 1 on dom Ψ. [sent-774, score-0.318]

74 Query stochastic gradient gt = g(ut , zt ), and update the weighted average gt : ˜ gt = (1 − θt )gt−1 + θt gt . [sent-785, score-1.024]

75 Solve for the exploration point vt = arg min gt , w + Ψ(w) + ˜ w L + βt h(w) At 5. [sent-787, score-0.5]

76 Compute wt by interpolation wt = (1 − θt )wt−1 + θt vt . [sent-788, score-1.267]

77 The additional costs are mainly the two vector interpolations (convex combinations) for computing ut and wt . [sent-790, score-0.617]

78 If we choose the two input sequences as αt = 1, √ βt = γ t + 1, ∀t ≥ 1, ∀t ≥ 0, then At = t, θt = 1/t, and gt = gt is the uniform average of all past gradients. [sent-795, score-0.466]

79 Theorem 8 Suppose dom Ψ is compact, say h(w) ≤ D2 for all w ∈ dom Ψ, and let the assumptions (34) and (39) hold. [sent-810, score-0.284]

80 More speciﬁcally, at each iteration of Algorithm 3, let gt itself be the average of the stochastic gradients at a small batch of samples computed at ut . [sent-818, score-0.484]

81 The optimality condition of the above minimization problem (Rockafellar, 1970, Section 27) states that there exists a subgradient s ∈ ∂ wt+1 1 such that gt + λ s + ¯ γ √ ϑ(wt+1 ) = 0. [sent-843, score-0.315]

82 (p − 1) t Following similar arguments as in Appendix A, we ﬁnd that it has a closed-form solution wt+1 = ϑ−1 (gt ) , ˆ where the elements of gt are given as ˆ   0  (i) √ gt = ˆ  − (p − 1) t g (i) − λ sgn g (i)  ¯t ¯t γ (i) ¯ if gt ≤ λ, i = 1, . [sent-844, score-0.7]

83 More speciﬁcally, for stochastic learning problems, let D2 = (1/2(p − 1)) w⋆ 2 , and Gq be an p p upper bound on gt q for all t ≥ 1. [sent-854, score-0.308]

84 Let G∞ be an upper bound on gt ∞ , that is, (i) gt ≤ G∞ , ∀ i = 1, . [sent-858, score-0.469]

85 If we choose q = ln n (assuming n ≥ e2 so that q ≥ 2), then gt G∞ n1/ ln n = G∞ e. [sent-863, score-0.291]

86 Acknowledgments The author is grateful to John Duchi for careful reading of a previous draft and pointing out that the regret analysis for general convex regularizations and for strongly convex regularizations can be uniﬁed. [sent-897, score-0.414]

87 First, let st denote the sum of the subgradients obtained up to time t in the RDA method, that is, t st = ¯ ∑ gτ = t gt , (42) τ=1 with the initialization s0 = 0. [sent-919, score-0.449]

88 , wt , we deﬁne the following gap sequence: t ∑ δt = max w∈FD τ=1 gτ , wτ − w + Ψ(wτ ) − tΨ(w) , t = 1, 2, 3, . [sent-975, score-0.512]

89 u∈FD By convexity of fτ for τ ≥ 1, we have δt ≥ = t ∑ fτ (wτ ) − fτ (w) + Ψ(wτ ) + max ∑ fτ (wτ ) + Ψ(wτ ) − fτ (w) − Ψ(w) + max τ=1 t τ=1 st , w − u − tΨ(u) u∈FD u∈FD = Rt (w) + max u∈FD st , w−u − t Ψ(u)−Ψ(w) st , w−u − t Ψ(u)−Ψ(w) . [sent-1014, score-0.324]

90 Let gt be the conditional expectation of gt given wt , that is, ˆ gt = E[gt | wt ] = E gt | z[t − 1] . [sent-1026, score-1.916]

91 ˆ Since gt ∈ ∂ f (wt , zt ), we have gt ∈ ∂ϕ(wt ) (e. [sent-1027, score-0.49]

92 τ=1 2585 (64) X IAO Since wt is a deterministic function of z[t − 1] and gt = E[gt | wt ], we have ˆ E ξτ z[τ − 1] = 0. [sent-1034, score-1.247]

93 Lemma 13 Let ψ be a closed proper convex function, and h be strongly convex on dom ψ with convexity parameter σh . [sent-1045, score-0.38]

94 For every t ≥ 1, deﬁne the following two functions: ℓt (w) = ϕ(ut ) + ∇ϕ(ut ), w − ut + Ψ(w), ˆ ℓt (w) = ϕ(ut ) + gt , w − ut + Ψ(w). [sent-1055, score-0.433]

95 Let qt = gt − ∇ϕ(ut ), then ˆ ℓt (w) = ℓt (w) + qt , w − ut . [sent-1057, score-0.49]

96 (66) w Since ∇ϕ is Lipschitz continuous with a constant L (see discussions following (34)), we have ϕ(wt ) ≤ ϕ(ut ) + ∇ϕ(ut ), wt − ut + L wt − ut 2 2 . [sent-1061, score-1.234]

97 At 2 = ℓt (1 − θt )wt−1 + θt vt + In the second inequality above, we used convexity of ℓt and ut = (1 − θt )wt−1 + θt vt−1 , and in the last inequality above, we used ℓt (w) ≤ φ(w) and the assumption αt2 ≤ At . [sent-1063, score-0.451]

98 2 2 Now using the inequality a b2 bc − c2 ≤ , 2 2a with a = βt−1 , b = αt qt ∗, ∀ a > 0, and c = vt − vt−1 , we have ˆ At φ(wt ) ≤ At−1 φ(wt−1 ) + αt ℓt (vt ) + L + βt−1 vt − vt−1 2 2 + αt2 qt 2 ∗ − αt qt , vt−1 − ut . [sent-1065, score-0.863]

99 Then by Lemma 13, we have ψt−1 (vt ) + (L + βt−1 )h(vt ) ≥ ψt−1 (vt−1 ) + (L + βt−1 )h(vt−1 ) + L + βt−1 vt − vt−1 2 2 therefore, At φ(wt ) − ψt (vt ) − (L + βt−1 )h(vt ) ≤ At−1 φ(wt−1 ) − ψt−1 (vt−1 ) − (L + βt−1 )h(vt−1 ) + αt2 qt 2 ∗ − αt qt , vt−1 − ut . [sent-1067, score-0.51]

100 By the assumption (1/2) w we have w ≤ 2D for all w ∈ dom Ψ, and therefore |ξt | ≤ αt qt w⋆ − vt−1 ≤ αt qt ∗ Using the assumption E exp qt ξt2 E exp (8αt2 Q2 D2 )2 2 /Q2 ∗ ∗ 2 w⋆ + vt−1 ≤ h(w) ≤ D2 for all w ∈ dom Ψ, ≤ αt qt √ ∗2 2D. [sent-1101, score-0.672]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.512), ('rda', 0.475), ('vt', 0.243), ('gt', 0.223), ('tg', 0.208), ('lnt', 0.168), ('regret', 0.163), ('dom', 0.142), ('iao', 0.134), ('rt', 0.133), ('sgd', 0.127), ('fd', 0.123), ('nnzs', 0.122), ('ual', 0.116), ('egularized', 0.116), ('ut', 0.105), ('nesterov', 0.099), ('st', 0.093), ('subgradient', 0.092), ('averaging', 0.082), ('qt', 0.081), ('online', 0.08), ('ethods', 0.075), ('ipm', 0.071), ('ez', 0.069), ('convex', 0.067), ('obos', 0.064), ('stochastic', 0.062), ('batch', 0.06), ('strongly', 0.059), ('prob', 0.058), ('regularization', 0.052), ('accelerated', 0.05), ('minw', 0.049), ('dual', 0.046), ('convexity', 0.045), ('ft', 0.045), ('zt', 0.044), ('duchi', 0.04), ('subgradients', 0.04), ('lan', 0.04), ('enhanced', 0.038), ('tseng', 0.038), ('rockafellar', 0.038), ('rates', 0.038), ('stepsize', 0.035), ('arg', 0.034), ('ln', 0.034), ('nemirovski', 0.033), ('exp', 0.032), ('tradeoffs', 0.031), ('juditsky', 0.031), ('sgn', 0.031), ('convergence', 0.03), ('optimization', 0.03), ('regularizations', 0.029), ('rint', 0.029), ('trda', 0.029), ('inequality', 0.029), ('qd', 0.029), ('modulus', 0.029), ('truncation', 0.027), ('gradient', 0.026), ('singer', 0.026), ('regularized', 0.025), ('hazan', 0.025), ('nemirovsky', 0.023), ('capability', 0.023), ('bound', 0.023), ('subdifferentiable', 0.022), ('gq', 0.022), ('interior', 0.022), ('sparsity', 0.022), ('lemma', 0.022), ('siam', 0.022), ('method', 0.021), ('nondecreasing', 0.021), ('rounding', 0.021), ('teboulle', 0.021), ('primal', 0.02), ('sequences', 0.02), ('sn', 0.019), ('theorem', 0.019), ('sparse', 0.019), ('auxiliary', 0.018), ('beck', 0.018), ('lipschitz', 0.018), ('sequence', 0.018), ('minimizer', 0.018), ('hollow', 0.017), ('proximal', 0.017), ('solutions', 0.017), ('gradients', 0.017), ('iteration', 0.017), ('solving', 0.016), ('koh', 0.016), ('tewari', 0.016), ('nonnegative', 0.016), ('incremental', 0.016), ('threshold', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

2 0.33656666 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E

3 0.22129446 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph

Abstract: We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identiﬁcation of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2 -regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efﬁcient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-ﬁnding component of our algorithm to L1 -regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available. Keywords: BFGS, variable metric methods, Wolfe conditions, subgradient, risk minimization, hinge loss, multiclass, multilabel, bundle methods, BMRM, OCAS, OWL-QN

4 0.11719915 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

Author: Jean-Yves Audibert, Sébastien Bubeck

Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efﬁcient and bandit label efﬁcient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a uniﬁed γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which deﬁnes a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus ﬁll in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modiﬁcation of the upper conﬁdence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efﬁcient, upper conﬁdence bound (UCB) policy, online learning, prediction with limited feedback.

5 0.1025421 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith

Abstract: The SGD-QN algorithm described in Bordes et al. (2009) contains a subtle ﬂaw that prevents it from reaching its design goals. Yet the ﬂawed SGD-QN algorithm has worked well enough to be a winner of the ﬁrst Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008). This document clariﬁes the situation, proposes a corrected algorithm, and evaluates its performance. Keywords: stochastic gradient descent, support vector machine, conditional random ﬁelds

6 0.08406204 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

7 0.079838082 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

8 0.079686105 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

9 0.069348417 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

10 0.063862167 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

11 0.063229054 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

12 0.056782257 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes

13 0.05622068 84 jmlr-2010-On Spectral Learning

14 0.05610729 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

15 0.05603952 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

16 0.055727702 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

17 0.053803597 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

18 0.053512678 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

19 0.052237943 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

20 0.050333943 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.208), (1, -0.223), (2, 0.274), (3, -0.078), (4, 0.511), (5, -0.001), (6, 0.007), (7, -0.003), (8, -0.016), (9, 0.051), (10, -0.069), (11, 0.041), (12, -0.095), (13, 0.035), (14, 0.017), (15, 0.041), (16, -0.079), (17, 0.105), (18, 0.027), (19, 0.053), (20, -0.006), (21, 0.032), (22, 0.001), (23, -0.034), (24, -0.002), (25, 0.026), (26, -0.025), (27, 0.051), (28, -0.012), (29, -0.071), (30, -0.066), (31, -0.018), (32, -0.051), (33, 0.072), (34, -0.007), (35, 0.071), (36, -0.039), (37, -0.003), (38, 0.006), (39, 0.014), (40, 0.027), (41, 0.004), (42, 0.015), (43, -0.041), (44, -0.065), (45, -0.009), (46, -0.068), (47, 0.015), (48, -0.004), (49, -0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96842575 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

2 0.86289507 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

3 0.83609462 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph

4 0.63867331 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith

5 0.36740795 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

Author: Jean-Yves Audibert, Sébastien Bubeck

6 0.26999626 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

7 0.26238036 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

8 0.25475246 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

9 0.24531209 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

10 0.22123383 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

11 0.21696755 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

12 0.20818411 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

13 0.20568144 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

14 0.1873634 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

15 0.18031102 80 jmlr-2010-On-Line Sequential Bin Packing

16 0.17595682 72 jmlr-2010-Matrix Completion from Noisy Entries

17 0.17017764 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

18 0.16606805 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

19 0.16520266 84 jmlr-2010-On Spectral Learning

20 0.16460031 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.021), (8, 0.017), (15, 0.41), (21, 0.027), (32, 0.055), (36, 0.027), (37, 0.058), (75, 0.128), (85, 0.116)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8644343 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

Author: Daniil Ryabko

Abstract: The problem is sequence prediction in the following setting. A sequence x1 , . . . , xn , . . . of discretevalued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure µ belongs to an arbitrary but known class C of stochastic process measures. We are interested in predictors ρ whose conditional probabilities converge (in some sense) to the “true” µ-conditional probabilities, if any µ ∈ C is chosen to generate the sequence. The contribution of this work is in characterizing the families C for which such predictors exist, and in providing a speciﬁc and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also ﬁnd several sufﬁcient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family C , as well as in terms of local behaviour of the measures in C , which in some cases lead to procedures for constructing such predictors. It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family. Keywords: sequence prediction, time series, online prediction, Bayesian prediction

same-paper 2 0.76134712 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

3 0.46125999 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

4 0.45146459 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

Author: Jean-Yves Audibert, Sébastien Bubeck

5 0.45135364 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Author: Nguyen Xuan Vinh, Julien Epps, James Bailey

Abstract: Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0, 1] range better than other normalized variants. Keywords: clustering comparison, information theory, adjustment for chance, normalized information distance

6 0.44931757 63 jmlr-2010-Learning Instance-Specific Predictive Models

7 0.44133496 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

8 0.44037327 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

9 0.43637192 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

10 0.43149486 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

11 0.430639 109 jmlr-2010-Stochastic Composite Likelihood

12 0.42935333 102 jmlr-2010-Semi-Supervised Novelty Detection

13 0.42835099 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

14 0.42808577 25 jmlr-2010-Composite Binary Losses

15 0.42767665 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

16 0.42662168 27 jmlr-2010-Consistent Nonparametric Tests of Independence

17 0.42596844 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

18 0.42258611 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

19 0.42114437 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

20 0.42000371 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning