jmlr jmlr2011 jmlr2011-8 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
Reference: text
sentIndex sentText sentNum sentScore
1 Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. [sent-9, score-0.266]
2 We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. [sent-10, score-0.384]
3 We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. [sent-12, score-0.296]
4 Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization 1. [sent-13, score-0.233]
5 We denote a sequence of vectors by subscripts, that is, xt , xt+1 , . [sent-28, score-0.464]
6 The subdifferential set of a function f evaluated at x is denoted ∂ f (x), and a particular vector in the subdifferential set is denoted by f ′ (x) ∈ ∂ f (x) or gt ∈ ∂ ft (xt ). [sent-32, score-0.742]
7 Let g1:t = [g1 · · · gt ] denote the matrix obtained by concatenating the subgradient sequence. [sent-37, score-0.692]
8 In online learning, the learner repeatedly predicts a point xt ∈ X ⊆ Rd , which often represents a weight vector assigning importance values to various features. [sent-43, score-0.543]
9 The learner’s goal is to achieve low regret with respect to a static predictor x∗ in the (closed) convex set X ⊆ Rd (possibly X = Rd ) on a sequence of functions ft (x), measured as T T R(T ) = ∑ ft (xt ) − inf ∑ ft (x) . [sent-44, score-0.737]
10 x∈X t=1 t=1 At every timestep t, the learner receives the (sub)gradient information gt ∈ ∂ ft (xt ). [sent-45, score-0.759]
11 Standard subgradient algorithms then move the predictor xt in the opposite direction of gt while maintaining xt+1 ∈ X via the projected gradient update (e. [sent-46, score-1.311]
12 Using this notation, our generalization of standard gradient descent employs the update 1/2 G xt+1 = ΠX t −1/2 xt − ηGt 2122 gt . [sent-50, score-1.177]
13 Thus we specialize the update to diag(Gt )1/2 xt+1 = ΠX xt − η diag(Gt )−1/2 gt . [sent-52, score-1.081]
14 Each function is of the form φt (x) = ft (x) + ϕ(x) where ft and ϕ are (closed) convex functions. [sent-58, score-0.373]
15 At each round the algorithm makes a prediction xt ∈ X and then receives the function ft . [sent-61, score-0.64]
16 We define the regret with respect to the fixed (optimal) predictor x∗ as T T t=1 Rφ (T ) t=1 ∑ [φt (xt ) − φt (x∗ )] = ∑ [ ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ )] . [sent-62, score-0.521]
17 In the primal-dual subgradient method the algorithm makes a prediction xt on round t using the average gradient gt = 1 ∑tτ=1 gτ . [sent-67, score-1.212]
18 The update amounts to solving 1 ¯ xt+1 = argmin η gt , x + ηϕ(x) + ψt (x) , (3) t x∈X where η is a fixed step-size and x1 = argminx∈X ϕ(x). [sent-70, score-0.639]
19 The second method similarly has numerous names, including proximal gradient, forward-backward splitting, and composite mirror descent (Tseng, 2008; Duchi et al. [sent-71, score-0.266]
20 The composite mirror descent method employs a more immediate trade-off between the current gradient gt , ϕ, and staying close to xt using the proximal function ψ, xt+1 = argmin η gt , x + ηϕ(x) + Bψt (x, xt ) . [sent-74, score-2.404]
21 Informally, we obtain algorithms which are similar to secondorder gradient descent by constructing approximations to the Hessian of the functions ft , though we use roots of the matrices. [sent-82, score-0.272]
22 The A DAG RAD algorithm with full matrix divergences entertains bounds of the form 1/2 2 tr(GT ) x∗ Rφ (T ) = O and Rφ (T ) = O max xt − x∗ t≤T 1/2 2 tr(GT ) . [sent-86, score-0.517]
23 We further show that 1/2 T = d 1/2 tr GT ∑ inf S gt , S−1 gt 0, tr(S) ≤ d : S t=1 . [sent-87, score-1.342]
24 When our proximal function ψt (x) = x, diag(Gt )1/2 x we have bounds attainable in time at most linear in the dimension d of our problems of the form Rφ (T ) = O x∗ d ∞ ∑ g1:T,i and Rφ (T ) = O max xt − x∗ 2 t≤T i=1 d ∞ ∑ g1:T,i 2 . [sent-89, score-0.604]
25 i=1 Similar to the above, we will show that d ∑ g1:T,i 1/2 2=d i=1 T ∑ inf s gt , diag(s)−1 gt 0, 1, s ≤ d : s t=1 . [sent-90, score-1.151]
26 4), achieves a regret bound of T T inf ∑ ft (xt ) − x∈X ∑ ft (x) ≤ t=1 When X is bounded via supx,y∈X x − y our Theorem 5. [sent-99, score-0.507]
27 (6) t=1 ≤ D∞ , the following corollary is a simple consequence of 2124 A DAPTIVE S UBGRADIENT M ETHODS Corollary 1 Let the sequence {xt } ⊂ Rd be√generated by the update (4) and assume that maxt x∗ − xt ∞ ≤ D∞ . [sent-101, score-0.568]
28 Rφ (T ) ≤ √ 2dD∞ T s ∑ 0, 1,s ≤d inf gt t=1 2 diag(s)−1 = √ d 2D∞ ∑ g1:T,i 2 . [sent-103, score-0.585]
29 Here we give a few abstract examples that show that for sparse data (input sequences where gt has many zeros) the adaptive methods herein have better performance than non-adaptive methods. [sent-115, score-0.61]
30 In our examples we use the hinge loss, that is, ft (x) = [1 − yt zt , x ]+ , where yt is the label of example t and zt ∈ Rd is the data vector. [sent-116, score-0.388]
31 For contrast, the d, standard regret bound (6) for online gradient descent has D2 = 2 d and gt 2 ≥ 1, yielding best 2 √ case regret O( dT ). [sent-122, score-0.996]
32 So we see that in this sparse yet heavy tailed feature setting, A DAG RAD’s regret guarantee can be exponentially smaller in the dimension d than the non-adaptive regret bound. [sent-123, score-0.272]
33 Our remaining examples construct a sparse sequence for which there is a perfect predictor that the adaptive methods learn after d iterations, while standard online gradient descent (Zinkevich, 2125 D UCHI , H AZAN AND S INGER 2003) suffers significantly higher loss. [sent-124, score-0.235]
34 We assume the domain X is compact, so that for online √ √ gradient descent we set ηt = η/ t, which gives the optimal O( T ) regret (the setting of η does not matter to the adversary we construct). [sent-125, score-0.294]
35 Evidently, we can take D∞ = 2, and this choice simply results in the update xt+1 = xt − 2 diag(Gt )−1/2 gt followed by projection (1) onto X for A DAG RAD (we use a pseudo-inverse if the inverse does not exist). [sent-129, score-1.081]
36 It is clear that the update to parameter xi at these iterations is different, and amounts to xt+1 = xt + ei A DAG RAD η xt+1 = xt + √ t (Gradient Descent) . [sent-141, score-0.979]
37 In short, A DAG RAD achieves constant regret per dimension while online gradient descent can suffer arbitrary loss (for unbounded t). [sent-146, score-0.294]
38 We use a similar construction to the diagonal case to show a situation in which the full matrix update from (5) gives substantially lower regret than stochastic gradient √ descent. [sent-149, score-0.319]
39 Instead of having zt cycle through the unit vectors, we make zt cycle through the vi so that zt = ±vi . [sent-155, score-0.325]
40 We let the label yt = sign( 1,V ⊤ zt ) = sign ∑d vi , zt . [sent-156, score-0.284]
41 2126 A DAPTIVE S UBGRADIENT M ETHODS we change the proximal function to achieve performance guarantees which are competitive with the best proximal term found in hindsight. [sent-171, score-0.248]
42 The latter can be very useful when our gradient vectors gt are sparse, for example, in a classification setting where examples may have only a small number of non-zero features. [sent-173, score-0.622]
43 Whenever the predictor µt attains a margin value smaller than 1, AROW performs the update βt = 1 , αt = [1 − yt zt , µt ]+ , zt , Σt zt + λ µt+1 = µt + αt Σt yt zt , Σt+1 = Σt − βt Σt xt xt⊤ Σt . [sent-212, score-0.926]
44 McMahan and Streeter focus on what they term the competitive ratio, which is the ratio of the worst case regret of the adaptive algorithm to the worst case regret of a non-adaptive algorithm with the best proximal term ψ chosen in hindsight. [sent-224, score-0.44]
45 Tighter regret bounds using the variation of the cost functions ft were proposed by Cesa-Bianchi et al. [sent-228, score-0.328]
46 We begin by providing two corollaries based on previous work that give the regret of our base algorithms when the proximal function ψt is allowed to change. [sent-239, score-0.277]
47 For any x∗ ∈ X , T ∑ t=1 ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ ) ≤ η T 1 ψT (x∗ ) + ∑ ft′ (xt ) η 2 t=1 2 ∗ ψt−1 . [sent-250, score-0.352]
48 For any x∗ ∈ X , T ∑ ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ ) t=1 ≤ η T 1 T −1 1 Bψ1 (x∗ , x1 ) + ∑ Bψt+1 (x∗ , xt+1 ) − Bψt (x∗ , xt+1 ) + ∑ ft′ (xt ) η η t=1 2 t=1 2 . [sent-258, score-0.352]
49 ψt∗ (11) The above corollaries allow us to prove regret bounds for a family of algorithms that iteratively modify the proximal functions ψt in attempt to lower the regret bounds. [sent-259, score-0.429]
50 t τ=1 x∈X Composite Mirror Descent Update (4): xt+1 = argmin η gt , x + ηϕ(x) + Bψt (x, xt ) . [sent-264, score-1.052]
51 Whereas earlier analysis requires a learning rate to slow changes between predictors xt and xt+1 , we will instead automatically grow the proximal function we use to achieve asymptotically low regret. [sent-270, score-0.588]
52 It is natural to suspect that for s achieving the infimum in Equation (12), if we use a proximal function similar to ψ(x) = x, diag(s)x with associated squared dual norm x 2 ∗ = x, diag(s)−1 x , we should do well lowering the gradient ψ terms in the regret bounds (10) and (11). [sent-281, score-0.358]
53 Lemma 4 Let gt = ft′ (xt ) and g1:t and st be defined as in Algorithm 1. [sent-285, score-0.642]
54 Then T ∑ t=1 d gt , diag(st )−1 gt ≤ 2 ∑ g1:T,i 2 . [sent-286, score-1.132]
55 i=1 To obtain a regret bound, we need to consider the terms consisting of the dual-norm of the subgradient in the regret bounds (10) and (11), which is ft′ (xt ) 2 t∗ . [sent-287, score-0.414]
56 From the definition of st in Algorithm 1, we clearly have ft′ (xt ) 2 t∗ ≤ gt , diag(st )−1 gt . [sent-289, score-1.208]
57 (13) i=1 To obtain a bound for a primal-dual subgradient method, we set δ ≥ maxt gt ∞ , in which case gt 2 ∗ ≤ gt , diag(st )−1 gt , and we follow the same lines of reasoning to achieve the inequalψt−1 ity (13). [sent-292, score-2.409]
58 For xt generated using the primaldual subgradient update (3) with δ ≥ maxt gt ∞ , for any x∗ ∈ X , Rφ (T ) ≤ δ ∗ x η 2 2+ 1 ∗ x η 2 ∞ d d ∑ 2 + η ∑ g1:T,i g1:T,i 2. [sent-300, score-1.226]
59 i=1 i=1 For xt generated using the composite mirror-descent update (4), for any x∗ ∈ X Rφ (T ) ≤ 1 max x∗ − xt 2η t≤T 2 ∞ d d ∑ g1:T,i 2 + η ∑ g1:T,i 2. [sent-301, score-1.032]
60 Furthermore, define γT T d ∑ g1:T,i 2 = inf s i=1 ∑ t=1 d gt , diag(s)−1 gt : 1, s ≤ ∑ g1:T,i 2, s 0 . [sent-305, score-1.151]
61 More precisely, McMahan and Streeter (2010) show that if X is contained within an ℓ∞ ball of radius √ and contains an ℓ∞ ball of radius r, then the bound in the above corollary is within a R factor of 2R/r of the regret of the best diagonal proximal matrix, chosen in hindsight. [sent-323, score-0.329]
62 As in the diagonal case, we build on intuition garnered from an optimization problem, and in particular, we seek a matrix S which is the solution to the following minimization problem: T min S ∑ gt , S−1 gt s. [sent-328, score-1.167]
63 If we iteratively use 1/2 divergences of the form ψt (x) = x, Gt x , we might expect as in the diagonal case to attain low regret by collecting gradient information. [sent-333, score-0.247]
64 For xt generated using the primal-dual subgradient update of (3) and δ ≥ maxt gt 2 , for any x∗ ∈ X δ ∗ 2 1 ∗ 2 1/2 1/2 Rφ (T ) ≤ x 2+ x 2 tr(GT ) + η tr(GT ). [sent-337, score-1.226]
65 η η For xt generated with the composite mirror-descent update of (4), if x∗ ∈ X and δ ≥ 0 Rφ (T ) ≤ δ ∗ x η 2 2+ 1 max x∗ − xt 2η t≤T 2133 1/2 1/2 2 2 tr(GT ) + η tr(GT ). [sent-338, score-1.032]
66 t τ=1 x∈X Composite Mirror Descent Update ((4)): xt+1 = argmin η gt , x + ηϕ(x) + Bψt (x, xt ) . [sent-340, score-1.052]
67 Now we use the fact that G1 is a rank 1 PSD matrix with non-negative trace to see that T −1 ∑ t=1 x∗ − xt+1 2 2 ≤ max x∗ − xt t≤T 1/2 1/2 tr(Gt+1 ) − tr(Gt ) 2 1/2 )− 2 tr(GT x∗ − x1 1/2 2 2 tr(G1 ) . [sent-346, score-0.48]
68 Then T T ∑ t=1 † gt , St† gt ≤ 2 ∑ gt , ST gt = 2 tr(GT 1/2 ) . [sent-355, score-2.264]
69 Now, assume the lemma is true for T − 1, so from the inductive assumption we get T ∑ gt , St† gt ≤ 2 t=1 T −1 ∑ t=1 † † gt , ST −1 gt + gT , ST gT . [sent-358, score-2.289]
70 † T −1 Since ST −1 does not depend on t we can rewrite ∑t=1 gt , ST −1 gt as † tr ST −1 , T −1 ∑ gt gt⊤ = tr((G† −1 )1/2 GT −1 ) , T t=1 where the right-most equality follows from the definitions of St and Gt . [sent-359, score-1.889]
71 Therefore, we get T ∑ gt , St† gt t=1 ≤ 2 tr((G† −1 )1/2 GT −1 ) + gT , (G† )1/2 gT T T 1/2 = 2 tr(GT −1 ) + gT , (G† )1/2 gT T . [sent-360, score-1.132]
72 Using Lemma 8 with the substitution B = GT , ν = 1, and g = gt lets us exploit the concavity of the 1/2 function tr(A1/2 ) to bound the above sum by 2 tr(GT ). [sent-361, score-0.566]
73 As in the diagonal case, we have that the squared dual norm (seminorm when δ = 0) associated with ψt is x Thus it is clear that gt show that gt that 2 ∗ ψt−1 2 ψt∗ 2 ψt∗ = x, (δI + St )−1 x . [sent-363, score-1.193]
74 For the dual-averaging algorithms, we use Lemma 9 above ≤ gt , St† gt so long as δ ≥ gt T ∑ t=1 ft′ (xt ) 2 ψt∗ 2. [sent-365, score-1.698]
75 The corollary underscores that for learning problems in which there is a rotation U of the space for which the gradient vectors gt have small inner products gt ,Ugt (essentially a sparse basis for the gt ) then using full-matrix proximal functions can attain significantly lower regret. [sent-372, score-1.912]
76 Then the regret of the sequence {xt } generated by Algorithm 2 when using the primal-dual subgradient update with η = x∗ 2 is Rφ (T ) ≤ 2 x∗ Let X be compact set so that supx∈X x − x∗ mirror descent update with δ = 0, we have Rφ (T ) ≤ √ 1/2 2D tr(GT ) = √ 1/2 2 tr(GT ) + δ 2 x∗ 2 . [sent-374, score-0.453]
77 Taking η = D/ 2 and using the composite T 2dD inf S ∑ gt⊤ S−1 gt : S t=1 0, tr(S) ≤ d . [sent-376, score-0.638]
78 (18) In particular, at time t for the RDA update, we have u = ηt gt . [sent-387, score-0.566]
79 For the composite gradient update (4), ¯ η gt , x + 1 1 1 x − xt , Ht (x − xt ) = ηgt − Ht xt , x + x, Ht x + xt , Ht xt 2 2 2 so that u = ηgt − Ht xt . [sent-388, score-3.51]
80 Here we need to keep an unnormalized version of the average gt . [sent-405, score-0.566]
81 Concretely, we keep track of ¯ t ut = t gt = ∑τ=1 gτ = ut−1 + gt , then use the update (19): ¯ xt,i = sign(−ut,i ) ηt |ut,i | −λ Ht,ii t , + where Ht can clearly be updated lazily in a similar fashion. [sent-406, score-1.183]
82 vi j /ai j ≥ vi j+1 /ai j+1 ρ ρ vi S ET ρ := max ρ : ∑ j=1 ai j vi j − aiρ ∑ j=1 a2j < c i ρ S ET θ = ρ ∑ j=1 ai j vi j −c ρ ∑ j=1 a2j i R ETURN z∗ where z∗ = [vi − θai ]+ . [sent-412, score-0.458]
83 −1/2 Now, by appropriate choice of v = −H −1/2 u = −ηtHt gt for the primal-dual update (3) and ¯ 1/2 −1/2 v = Ht xt − ηHt gt for the mirror-descent update (4), we arrive at the problem min z 1 z−v 2 2 2 d s. [sent-420, score-1.698]
84 By the symmetry of the objective (20), we can assume without loss of generality that v 0 and constrain z 0, and a bit of manipulation with the Lagrangian (see Appendix G) for the problem shows that the solution z∗ has the form z∗ = i vi − θ∗ ai if vi ≥ θ∗ ai 0 otherwise for some θ∗ ≥ 0. [sent-424, score-0.23]
85 To remind the reader, PA is an online learning procedure with the update xt+1 = argmin [1 − yt zt , x ]+ + x λ x − xt 2 2 2 , where λ is a regularization parameter. [sent-521, score-0.731]
86 By using a representer theorem it is also possible to derive efficient updates for PA and AROW when the loss is the logistic loss, log(1 + exp(−yt zt , xt )). [sent-523, score-0.57]
87 Our online regret bounds can be naturally converted into rate of convergence and generalization bounds (Cesa-Bianchi et al. [sent-762, score-0.23]
88 Such an algorithm can interpolate between the computational simplicity of the diagonal proximal functions and the ability of full matrices to capture correlation in the gradient vectors. [sent-787, score-0.232]
89 Thus, for Zinkevich’s projected gradient, we have xt = αt,1 v1 for some multiplier αt,1 > 0 when t ≤ η2 /ε2 . [sent-807, score-0.507]
90 After the first η2 /ε2 rounds, we perform the updates xt+1 = Π η xt + √ vi t √ x 2≤ d √ for some index i, but as in the diagonal case, η/ t ≤ ε, and by orthogonality of vi , v j , we have xt = V αt for some αt 0, and the projection step can only shrink the multiplier αt,i for index i. [sent-808, score-1.166]
91 However, an identical argument shows that Gt is simply updated to v1 v⊤ + vi v⊤ , in which case xt = v√+ vi . [sent-813, score-0.616]
92 Indeed, an inductive argument shows that until all the 1 i 1 vectors vi are seen, we have xt 2 < d by orthogonality, and eventually we have d xt = ∑ vi and d xt 2 ∑ = i=1 so that xt ∈ X = {x : x 1 and suffer no loss. [sent-814, score-2.008]
93 2 Thus, T T T T 2 a1:T −1 2+ a2 T a1:T =2 2 a2 bT − a2 + √ T ≤ 2 bT = 2 a1:T T bT 2 Having proved the bound (24), we note that by construction that st,i = g1:t,i 2 , so T ∑ t=1 T d gt , diag(st ) gt = ∑ ∑ −1 t=1 i=1 2 gt,i g1:t,i d 2 ≤ 2 ∑ g1:T,i i=1 2. [sent-845, score-1.132]
94 Indeed, letting gt ∈ ∂ ft (xt ) and defining zt = ∑tτ=1 gτ , we have T ∑ ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ ) t=1 T ≤ ∑ gt , xt − x∗ − ϕ(x∗ ) + ϕ(xt ) t=1 T T 1 ≤ ∑ gt , xt + ϕ(xt ) + sup − ∑ gt , x − T ϕ(x) − ψT (x) + ψT (x∗ ) η x∈X t=1 t=1 = T 1 ψT (x∗ ) + ∑ gt , xt + ϕ(xt ) + ψ∗ (−zT ) . [sent-918, score-4.833]
95 T η t=1 Since ψt+1 ≥ ψt , it is clear that T 1 ψ∗ (−zT ) = − ∑ gt , xT +1 − T ϕ(xT +1 ) − ψT (xT +1 ) T η t=1 T 1 ≤ − ∑ gt , xT +1 − (T − 1)ϕ(xT +1 ) − ϕ(xT +1 ) − ψT −1 (xT +1 ) η t=1 1 ≤ sup − zT , x − (T − 1)ϕ(x) − ψT −1 (x) − ϕ(xT +1 ) = ψ∗ −1 (−zT ) − ϕ(xT +1 ). [sent-919, score-1.132]
96 T We can repeat the same sequence of steps that gave the last equality to see that T ∑ t=1 ft (xt ) + ϕ(xt+1 ) − ft (x∗ ) − ϕ(x∗ ) ≤ 1 η T ψT (x∗ ) + ∑ gt η 2 t=1 2 ∗ ψt−1 + ψ∗ (−z0 ). [sent-921, score-0.918]
97 Then for any x∗ , η ( ft (xt ) − ft (x∗ )) + η (ϕ(xt+1 ) − ϕ(x∗ )) ≤ Bψt (x∗ , xt ) − Bψt (x∗ , xt+1 ) + η2 ′ f (xt ) 2 t 2 ψt∗ Proof The optimality of xt+1 for (4) implies for all x ∈ X and ϕ′ (xt+1 ) ∈ ∂ϕ(xt+1 ) x − xt+1 , η f ′ (xt ) + ∇ψt (xt+1 ) − ∇ψt (xt ) + ηϕ′ (xt+1 ) ≥ 0. [sent-929, score-0.816]
98 From the subgradient inequality for convex functions, we have ft (x∗ ) ≥ ft (xt ) + ft′ (xt ), x∗ − xt , or ft (xt ) − ft (x∗ ) ≤ ft′ (xt ), xt − x∗ , and likewise for ϕ(xt+1 ). [sent-931, score-1.779]
99 We need to solve update (3), which amounts to min η gt , x + ¯ x 1 δ x 2t 2 2+ 1 x, diag(st )x + ηλ x 2t 1 . [sent-942, score-0.617]
100 Thus, had we known the last non-zero index ρ, we would have obtained ρ vρ ρ 2 ∑ ai vi − aρ ∑ ai = ∑ a2 i i=1 i=1 i=1 ρ ρ ∑ ai vi − i=1 vρ+1 ρ 2 ρ+1 2 ∑ a = ∑ ai aρ+1 i=1 i i=1 vi vρ − ai aρ c). [sent-968, score-0.423]
wordName wordTfidf (topN-words)
[('gt', 0.566), ('xt', 0.464), ('rda', 0.227), ('arow', 0.217), ('rad', 0.195), ('tr', 0.191), ('ft', 0.176), ('regret', 0.136), ('dag', 0.135), ('subgradient', 0.126), ('proximal', 0.124), ('inger', 0.118), ('ubgradient', 0.118), ('uchi', 0.118), ('azan', 0.1), ('daptive', 0.09), ('zt', 0.083), ('st', 0.076), ('vi', 0.076), ('ht', 0.067), ('online', 0.062), ('ethods', 0.062), ('gg', 0.061), ('diag', 0.057), ('gradient', 0.056), ('composite', 0.053), ('update', 0.051), ('obos', 0.05), ('streeter', 0.05), ('duchi', 0.05), ('mirror', 0.049), ('mcmahan', 0.047), ('adaptive', 0.044), ('zinkevich', 0.044), ('xiao', 0.041), ('descent', 0.04), ('ai', 0.039), ('rd', 0.037), ('pa', 0.035), ('diagonal', 0.035), ('hazan', 0.035), ('corollary', 0.034), ('proportion', 0.034), ('stepsize', 0.033), ('predictor', 0.033), ('crammer', 0.033), ('hx', 0.032), ('outer', 0.031), ('imagenet', 0.031), ('census', 0.028), ('multiplier', 0.028), ('mum', 0.027), ('ftrl', 0.026), ('income', 0.026), ('regularization', 0.026), ('dual', 0.026), ('lemma', 0.025), ('multiclass', 0.025), ('adagrad', 0.025), ('ccat', 0.025), ('ecat', 0.025), ('gcat', 0.025), ('prequel', 0.025), ('stochastic', 0.024), ('updates', 0.023), ('yt', 0.023), ('argmin', 0.022), ('grangier', 0.021), ('nput', 0.021), ('convex', 0.021), ('divergences', 0.02), ('nemirovski', 0.019), ('mal', 0.019), ('evidently', 0.019), ('maxt', 0.019), ('inf', 0.019), ('rounds', 0.019), ('bt', 0.019), ('argminx', 0.019), ('eturn', 0.019), ('mcat', 0.019), ('rankers', 0.019), ('sign', 0.019), ('gradients', 0.018), ('corollaries', 0.017), ('full', 0.017), ('learner', 0.017), ('root', 0.016), ('kalai', 0.016), ('ada', 0.016), ('infrequent', 0.016), ('juditsky', 0.016), ('pardalos', 0.016), ('rakhlin', 0.016), ('bounds', 0.016), ('rank', 0.016), ('projected', 0.015), ('image', 0.015), ('pass', 0.015), ('sparsity', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000019 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
2 0.47005874 14 jmlr-2011-Better Algorithms for Benign Bandits
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
3 0.1390972 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre
Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning
4 0.12339973 36 jmlr-2011-Generalized TD Learning
Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii
Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a unified way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function
5 0.11390073 28 jmlr-2011-Double Updating Online Learning
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
6 0.086432345 104 jmlr-2011-X-Armed Bandits
7 0.085783087 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
8 0.083125129 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
9 0.071414687 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
10 0.069728509 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
11 0.065389842 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
12 0.062470168 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
13 0.059160285 101 jmlr-2011-Variable Sparsity Kernel Learning
14 0.058326326 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
15 0.05143236 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
16 0.049461201 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
17 0.048013378 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
18 0.046589345 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
19 0.042612907 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
20 0.04252933 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
topicId topicWeight
[(0, 0.258), (1, 0.437), (2, -0.108), (3, -0.015), (4, 0.197), (5, 0.105), (6, 0.441), (7, 0.057), (8, 0.053), (9, -0.181), (10, 0.048), (11, 0.091), (12, -0.045), (13, -0.057), (14, -0.133), (15, 0.038), (16, 0.075), (17, -0.111), (18, -0.033), (19, 0.096), (20, -0.052), (21, 0.124), (22, -0.015), (23, 0.023), (24, -0.009), (25, -0.097), (26, 0.059), (27, 0.0), (28, 0.001), (29, 0.001), (30, 0.081), (31, -0.046), (32, 0.056), (33, 0.031), (34, 0.005), (35, 0.005), (36, 0.012), (37, 0.03), (38, -0.019), (39, 0.001), (40, -0.023), (41, 0.027), (42, -0.024), (43, -0.018), (44, 0.007), (45, -0.016), (46, 0.002), (47, 0.016), (48, 0.011), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.98671651 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
2 0.95589364 14 jmlr-2011-Better Algorithms for Benign Bandits
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
3 0.49384731 28 jmlr-2011-Double Updating Online Learning
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
4 0.41217858 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre
Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning
5 0.35260341 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir
Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory
6 0.35214472 36 jmlr-2011-Generalized TD Learning
7 0.29867929 104 jmlr-2011-X-Armed Bandits
8 0.2718254 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
9 0.21813075 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
10 0.21726003 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
11 0.21245542 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
12 0.20239782 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
13 0.20068893 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
14 0.19393563 101 jmlr-2011-Variable Sparsity Kernel Learning
15 0.17002834 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
16 0.15989196 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
17 0.15262447 35 jmlr-2011-Forest Density Estimation
18 0.15203284 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
19 0.15173787 55 jmlr-2011-Learning Multi-modal Similarity
20 0.14214362 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
topicId topicWeight
[(4, 0.036), (9, 0.055), (10, 0.02), (24, 0.066), (31, 0.073), (32, 0.031), (41, 0.022), (60, 0.01), (67, 0.024), (70, 0.373), (73, 0.028), (78, 0.12), (90, 0.016)]
simIndex simValue paperId paperTitle
1 0.80335265 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
same-paper 2 0.77750784 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
3 0.76950645 35 jmlr-2011-Forest Density Estimation
Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman
Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency
4 0.43973151 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
5 0.42173257 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
Author: Mauricio A. Álvarez, Neil D. Lawrence
Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes
6 0.4216564 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
7 0.4207744 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
8 0.4073216 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
9 0.40579897 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
10 0.40291008 12 jmlr-2011-Bayesian Co-Training
11 0.40059116 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
12 0.40040961 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
13 0.40020213 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
14 0.39732546 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
15 0.39526764 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
16 0.39322737 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
17 0.39192089 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
18 0.39178374 91 jmlr-2011-The Sample Complexity of Dictionary Learning
19 0.39092118 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
20 0.39009532 7 jmlr-2011-Adaptive Exact Inference in Graphical Models