nips nips2013 nips2013-20 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rie Johnson, Tong Zhang
Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1
Reference: text
sentIndex sentText sentNum sentScore
1 , Beijing, China Rutgers University, New Jersey, USA Abstract Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. [sent-2, score-0.222]
2 To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). [sent-3, score-0.579]
3 For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). [sent-4, score-0.423]
4 Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. [sent-6, score-0.094]
5 , (xn , yn ), where xi ∈ Rd and yi ∈ R, if we use the squared loss ψi (w) = (w xi − yi )2 , then we can obtain least squares regression. [sent-16, score-0.126]
6 A standard method is gradient descent, which can be described by the following update rule for t = 1, 2, . [sent-21, score-0.163]
7 (2) i=1 However, at each step, gradient descent requires evaluation of n derivatives, which is expensive. [sent-25, score-0.134]
8 A popular modification is stochastic gradient descent (SGD): where at each iteration t = 1, 2, . [sent-26, score-0.173]
9 The advantage of stochastic gradient is that each step only relies on a single derivative ψi (·), and thus the computational cost is 1/n that of the standard gradient descent. [sent-35, score-0.215]
10 However, a disadvantage of the method is that the randomness introduces variance — this is caused by the fact that gt (w(t−1) , ξt ) equals the gradient P (w(t−1) ) in expectation, but each gt (w(t−1) , ξt ) is different. [sent-36, score-0.28]
11 In particular, if gt (w(t−1) , ξt ) is large, then we have a relatively large variance which slows down the convergence. [sent-37, score-0.178]
12 5L w − w 2 ≤ ψi (w ) (w − w ), (5) and convex; and P (w) is strongly convex P (w) − P (w ) − 0. [sent-39, score-0.124]
13 However, for SGD, due to the variance of random sampling, we generally need to choose ηt = O(1/t) and obtain a slower sub-linear convergence rate of O(1/t). [sent-42, score-0.191]
14 This means that we have a trade-off of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for gradient descent. [sent-43, score-0.24]
15 Although the fast computation means it can reach an approximate solution relatively quickly, and thus has been proposed by various researchers for large scale problems Zhang [2004], Shalev-Shwartz et al. [sent-44, score-0.074]
16 org/projects/sgd), the convergence slows down when we need a more accurate solution. [sent-47, score-0.068]
17 [2012], Shalev-Shwartz and Zhang [2012] proposed methods that achieve such a variance reduction effect for SGD, which leads to a linear convergence rate when ψi (w) is smooth and strongly convex. [sent-50, score-0.313]
18 These methods are suitable for training convex linear prediction problems such as logistic regression or least squares regression, and in fact, SDCA is the method implemented in the popular lib-SVM package Hsieh et al. [sent-53, score-0.244]
19 However, both proposals require storage of all gradients (or dual variables). [sent-55, score-0.103]
20 Although this issue may not be a problem for training simple regularized linear prediction problems such as least squares regression, the requirement makes it unsuitable for more complex applications where storing all gradients would be impractical. [sent-56, score-0.147]
21 One example is training certain structured learning problems with convex loss, and another example is training nonconvex neural networks. [sent-57, score-0.375]
22 In order to remedy the problem, we propose a different method in this paper that employs explicit variance reduction without the need to store the intermediate gradients. [sent-58, score-0.174]
23 We show that if ψi (w) is strongly convex and smooth, then the same convergence rate as those of Le Roux et al. [sent-59, score-0.244]
24 Even if ψi (w) is nonconvex (such as neural networks), under mild assumptions, it can be shown that asymptotically the variance of SGD goes to zero, and thus faster convergence can be achieved. [sent-61, score-0.362]
25 • We provide a much simpler proof of the linear convergence results for smooth and strongly convex loss, and our view provides a significantly more intuitive explanation of the fast convergence by explicitly connecting the idea to variance reduction in SGD. [sent-64, score-0.45]
26 • The relatively intuitive variance reduction explanation also applies to nonconvex optimization problems, and thus this idea can be used for complex problems such as training deep neural networks. [sent-66, score-0.44]
27 2 Stochastic Variance Reduced Gradient One practical issue for SGD is that in order to ensure convergence the learning rate ηt has to decay to zero. [sent-67, score-0.132]
28 The need for a small learning rate is due to the variance 2 of SGD (that is, SGD approximates the full gradient using a small batch of samples or even a single example, and this introduces variance). [sent-69, score-0.255]
29 Moreover, we maintain the average gradient ˜ µ= ˜ P (w) = ˜ 1 n n ψi (w), ˜ i=1 and its computation requires one pass over the data using w. [sent-73, score-0.088]
30 Note that the expectation of ψi (w) − ˜ ˜ µ over i is zero, and thus the following update rule is generalized SGD: randomly draw it from ˜ {1, . [sent-74, score-0.098]
31 The update rule in (7) can also be obtained by defining the auxiliary function ˜ ψi (w) = ψi (w) − ( ψi (w) − µ) w. [sent-80, score-0.075]
32 i=1 Now we may apply the standard SGD to the new representation P (w) = the update rule (7). [sent-82, score-0.075]
33 1 n n i=1 ˜ ψi (w) and obtain To see that the variance of the update rule (7) is reduced, we note that when both w and w(t) converge ˜ to the same parameter w∗ , then µ → 0. [sent-83, score-0.162]
34 We call this method stochastic variance reduced gradient (SVRG) because it explicitly reduces the variance of SGD. [sent-86, score-0.319]
35 Unlike SGD, the learning rate ηt for SVRG does not have to decay, which leads to faster convergence as one can use a relatively large learning rate. [sent-87, score-0.147]
36 In practical implementations, it is natural to choose option I, or take ws to be the average of the ˜ past t iterates. [sent-89, score-0.17]
37 Note that each stage s requires 2m + n gradient computations (for some convex problems, one may save the intermediate gradients ψi (w), ˜ and thus only m + n gradient computations are needed). [sent-91, score-0.341]
38 Therefore it is natural to choose m to be the same order of n but slightly larger (for example m = 2n for convex problems and m = 5n for nonconvex problems in our experiments). [sent-92, score-0.271]
39 In comparison, standard SGD requires only m gradient computations. [sent-93, score-0.088]
40 Since gradient may be the computationally most intensive operation, for fair comparison, we compare SGD to SVRG based on the number of gradient computations. [sent-94, score-0.176]
41 3 Analysis For simplicity we will only consider the case that each ψi (w) is convex and smooth, and P (w) is strongly convex. [sent-95, score-0.124]
42 Assume that all ψi are convex and both (5) and (6) hold with γ > 0. [sent-98, score-0.084]
43 , n} and update weight wt = wt−1 − η( ψit (wt−1 ) − ψit (w) + µ) ˜ ˜ end option I: set ws = wm ˜ option II: set ws = wt for randomly chosen t ∈ {0, . [sent-109, score-0.71]
44 Given any i, consider gi (w) = ψi (w) − ψi (w∗ ) − We know that gi (w∗ ) = minw gi (w) since ψi (w∗ ) (w − w∗ ). [sent-113, score-0.274]
45 Therefore 0 = gi (w∗ ) ≤ min[gi (w − η gi (w))] η ≤ min[gi (w) − η gi (w) ψi (w) − 2 2 η 2 2 + 0. [sent-115, score-0.252]
46 ˜ 4 The first inequality uses the previously obtained inequality for E vt 2 , and the second inequality 2 convexity of P (w), which implies that −(wt−1 − w∗ ) P (wt−1 ) ≤ P (w∗ ) − P (wt−1 ). [sent-129, score-0.149]
47 We consider a fixed stage s, so that w = ws−1 and ws is selected after all of the updates have ˜ ˜ ˜ completed. [sent-130, score-0.135]
48 Due to the poor condition number, the standard batch gradient descent requires complexity of n ln(1/ ) iterations over the data to achieve accuracy of , which means we have to process n2 ln(1/ ) number of examples. [sent-142, score-0.134]
49 1/L and m = O(n) to obtain a convergence rate of α = 1/2. [sent-144, score-0.104]
50 [2012] and Shalev-Shwartz and Zhang [2012], and the explicit variance reduction argument provides better intuition on why this method works. [sent-149, score-0.153]
51 The SVRG algorithm can also be applied to smooth but not strongly convex problems. [sent-151, score-0.16]
52 A convergence rate of O(1/T ) may be obtained, which improves the standard SGD convergence rate of √ O(1/ T ). [sent-152, score-0.208]
53 If the system is locally (strongly) convex, then Theorem 1 can be directly applied, which implies local geometric convergence rate with a constant learning rate. [sent-154, score-0.104]
54 4 SDCA as Variance Reduction It can be shown that both SDCA and SAG are connected to SVRG in the sense they are also a variance reduction methods for SGD, although using different techniques. [sent-155, score-0.133]
55 In the following we present the variance reduction view of SDCA, which provides additional insights into these recently proposed fast convergence methods for stochastic optimization. [sent-156, score-0.25]
56 In SDCA, we consider the following problem with convex φi (w): w∗ = arg min P (w), P (w) = 1 n n φi (w) + 0. [sent-157, score-0.084]
57 λn Therefore in the SGD update (3), if we maintain a representation ∗ αi = − (10) n (t) w(t) = αi , (11) i=1 then the update of α becomes: (t−1) α (t) = (1 − ηt λ)αi − ηt φi (w) (t−1) (1 − ηt λ)α =i . [sent-165, score-0.07]
58 =i (12) This update rule requires ηt → 0 when t → ∞. [sent-166, score-0.075]
59 , n}, then the SGD rule in (12) and the dual coordinate ascent rule (13) both yield the gradient descent rule E[w(t |w(t−1) ] = w(t−1) − ηt P (w(t−1) ). [sent-174, score-0.339]
60 Therefore both can be regarded as different realizations of the more general stochastic gradient rule in (4). [sent-175, score-0.167]
61 This is because according to (10), when the primal-dual parameters (w, α) converge to the optimal parameters (w∗ , α∗ ), we have ( φi (w) + λnαi ) → 0, which means that even if the learning rate ηt stays bounded away from zero, the procedure can converge. [sent-177, score-0.086]
62 This is the same effect as SVRG, in the sense that the variance goes to zero asymptotically: as w → w∗ and α → α∗ , we have 1 n n ( φi (w) + λnαi )2 → 0. [sent-178, score-0.113]
63 i=1 That is, SDCA is also a variance reduction method for SGD, which is similar to SVRG. [sent-179, score-0.133]
64 From this discussion, we can view SVRG as an explicit variance reduction technique for SGD which is similar to SDCA. [sent-180, score-0.153]
65 In all the figures, the x-axis is computational cost measured by the number of gradient computations divided by n. [sent-185, score-0.102]
66 For SGD, it is the number of passes to go through the training data, and for SVRG in the nonconvex case (neural nets), it includes the additional computation of ψi (w) both in each ˜ iteration and for computing the gradient average µ. [sent-186, score-0.293]
67 For SVRG in our convex case, however, ψi (w) ˜ ˜ does not have to be re-computed in each iteration. [sent-187, score-0.084]
68 Since in this case the gradient is always a multiple of xi , i. [sent-188, score-0.104]
69 , ψi (w) = φi (w xi )xi where ψi (w) = φi (w xi ), ψi (w) can be compactly saved in ˜ memory by only saving scalars φi (w xi ) with the same memory consumption as SDCA and SAG. [sent-190, score-0.074]
70 25 0 50 #grad / n 100 (c) 1E+02 MNIST convex: training loss residual P(w)-P(w*) 1E-01 1E-04 1E-07 1E-10 1E-13 50 100 0 #grad / n SVRG SDCA SGD-best SGD:0. [sent-196, score-0.147]
71 001 Variance (b) (b Training loss - optimum Training loss (a) 0. [sent-197, score-0.135]
72 28 MNIST convex: update variance 1E+02 1E-01 1E-04 1E-07 1E-10 1E-13 1E-16 0 50 #grad / n SVRG SGD-best SGD-best/η(t) 100 SDCA SGD:0. [sent-205, score-0.122]
73 (b) Training loss residual P (w)−P (w∗ ); comparison with best-tuned SGD with learning rate scheduling and SDCA. [sent-209, score-0.22]
74 That is, when a relatively large learning rate η is used with SGD, training loss drops fast at first, but it oscillates above the minimum and never goes down to the minimum. [sent-217, score-0.245]
75 Therefore, to accelerate SGD, one has to start with relatively large η and gradually decrease it (learning rate scheduling), as commonly practiced. [sent-219, score-0.102]
76 By contrast, using a single relatively large value of η, SVRG smoothly goes down faster than SGD. [sent-220, score-0.087]
77 2 (b) and (c) compare SVRG with best-tuned SGD with learning rate scheduling and SDCA. [sent-223, score-0.125]
78 (Not surprisingly, the best-tuned SGD with learning rate scheduling outperformed the best-tuned SGD with a fixed learning rate throughout our experiments. [sent-225, score-0.186]
79 2 (b) shows training loss residual, which is training loss minus the optimum (estimated by running gradient descent for a very long time): P (w) − P (w∗ ). [sent-227, score-0.373]
80 We observe that SVRG’s loss residual goes down exponentially, which is in line with Theorem 1, and that SVRG is competitive with SDCA (the two lines are almost overlapping) and decreases faster than SGD-best. [sent-228, score-0.155]
81 2 (c), we show the variance of SVRG update −η( ψi (w) − ψi (w) + µ) in comparison with the variance of SGD update −η(t) ψi (w) ˜ ˜ and SDCA. [sent-230, score-0.244]
82 As expected, the variance of both SVRG and SDCA decreases as optimization proceeds, and the variance of SGD with a fixed learning rate (‘SGD:0. [sent-231, score-0.263]
83 The variance of the best-tuned SGD decreases, but this is due to the forced exponential decay of the learning rate and the variance of the gradients ψi (w) (the dotted line labeled as ‘SGD-best/η(t)’) stays high. [sent-233, score-0.324]
84 3 shows more convex-case results (L2-regularized logistic regression) in terms of training loss residual (top) and test error rate (bottom) on rcv1. [sent-235, score-0.252]
85 As protein and covtype do not come with labeled test data, we randomly split the training data into halves to make the training/test split. [sent-238, score-0.108]
86 html 7 1E-12 0 10 20 #grad / n 1E-5 1E-6 rcv1 convex 0. [sent-258, score-0.084]
87 24 0 10 20 #grad / n 30 SGD-best SDCA SVRG 50 #grad / n 100 CIFAR10 convex 0. [sent-265, score-0.084]
88 58 0 10 20 #grad / n 30 0 50 100 #grad / n SGD-best/η(t) SGD-best SVRG 1E-2 1E-3 1E-4 Training loss Variance 1E-1 MNIST nonconvex 0. [sent-277, score-0.205]
89 09 1E-5 0 100 #grad / n 200 CIFAR10 nonconvex SGD-best SVRG 1. [sent-282, score-0.153]
90 6 SGD-best SVRG Training loss MNIST nonconvex 1E+0 1. [sent-283, score-0.205]
91 4 we confirm that the results are similar to the convex case; i. [sent-301, score-0.084]
92 , SVRG reduces the variance and smoothly converges faster than the best-tuned SGD with learning rate scheduling, which is a de facto standard method for neural net training. [sent-303, score-0.238]
93 As said earlier, methods such as SDCA and SAG are not practical for neural nets due to their memory requirement. [sent-304, score-0.077]
94 However, further investigation, in particular with larger/deeper neural nets for which training cost is a critical issue, is still needed. [sent-306, score-0.116]
95 6 Conclusion This paper introduces an explicit variance reduction method for stochastic gradient descent methods. [sent-307, score-0.345]
96 For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of SDCA and SAG. [sent-308, score-0.299]
97 Moreover, unlike SDCA or SAG, this method does not require the storage of gradients, and thus is more easily applicable to complex problems such as structured prediction or neural network learning. [sent-310, score-0.077]
98 A dual coordinate descent method for large-scale linear SVM. [sent-322, score-0.106]
99 Stochastic dual coordinate ascent methods for regularized loss minimization. [sent-343, score-0.137]
100 Solving large scale linear prediction problems using stochastic gradient descent algorithms. [sent-348, score-0.205]
wordName wordTfidf (topN-words)
[('svrg', 0.697), ('sgd', 0.405), ('sdca', 0.325), ('grad', 0.22), ('wt', 0.156), ('nonconvex', 0.153), ('ws', 0.118), ('sag', 0.089), ('gradient', 0.088), ('variance', 0.087), ('convex', 0.084), ('gi', 0.084), ('roux', 0.077), ('scheduling', 0.064), ('rate', 0.061), ('training', 0.052), ('loss', 0.052), ('option', 0.052), ('mnist', 0.048), ('nets', 0.047), ('descent', 0.046), ('reduction', 0.046), ('le', 0.045), ('convergence', 0.043), ('gt', 0.043), ('residual', 0.043), ('vt', 0.041), ('strongly', 0.04), ('protein', 0.04), ('rule', 0.04), ('stochastic', 0.039), ('dual', 0.039), ('zhang', 0.038), ('gradients', 0.036), ('inequality', 0.036), ('smooth', 0.036), ('update', 0.035), ('cifar', 0.035), ('optimum', 0.031), ('decay', 0.028), ('logistic', 0.028), ('storage', 0.028), ('goes', 0.026), ('slows', 0.025), ('ascent', 0.025), ('stays', 0.025), ('schmidt', 0.024), ('indicative', 0.024), ('relatively', 0.023), ('wm', 0.023), ('expectation', 0.023), ('leon', 0.022), ('minw', 0.022), ('simpler', 0.022), ('remedy', 0.021), ('coordinate', 0.021), ('tong', 0.021), ('hsieh', 0.02), ('net', 0.02), ('explicit', 0.02), ('faster', 0.02), ('ln', 0.019), ('introduces', 0.019), ('francis', 0.019), ('nicolas', 0.019), ('smoothly', 0.018), ('regression', 0.018), ('accelerate', 0.018), ('fast', 0.018), ('reduced', 0.018), ('intuitive', 0.018), ('stage', 0.017), ('multiclass', 0.017), ('problems', 0.017), ('neural', 0.017), ('enjoys', 0.017), ('insights', 0.017), ('asymptotically', 0.016), ('test', 0.016), ('arxiv', 0.016), ('xi', 0.016), ('et', 0.016), ('summing', 0.015), ('prediction', 0.015), ('baidu', 0.015), ('facto', 0.015), ('legends', 0.015), ('preferring', 0.015), ('spotting', 0.015), ('webpage', 0.015), ('slow', 0.015), ('optimization', 0.014), ('computations', 0.014), ('yi', 0.014), ('decreases', 0.014), ('squares', 0.014), ('oscillates', 0.013), ('unsuitable', 0.013), ('memory', 0.013), ('explanation', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
Author: Rie Johnson, Tong Zhang
Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1
2 0.25602305 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent
Author: Shai Shalev-Shwartz, Tong Zhang
Abstract: Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1
3 0.22940288 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin
Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to find an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1
4 0.17012689 193 nips-2013-Mixed Optimization for Smooth Functions
Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin
Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1
5 0.15415654 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin
Abstract: In this paper, we are interested in the development of efficient algorithms for convex optimization problems in the simultaneous presence of multiple objectives and stochasticity in the first-order information. We cast the stochastic multiple objective optimization problem into a constrained optimization problem by choosing one function as the objective and try to bound other objectives by appropriate thresholds. We first examine a two stages exploration-exploitation based algorithm which first approximates the stochastic objectives by sampling and then solves a constrained stochastic optimization problem by projected gradient method. This method attains a suboptimal convergence rate even under strong assumption on the objectives. Our second approach is an efficient primal-dual stochastic algorithm. It leverages on the theory of Lagrangian method √ conin strained optimization and attains the optimal convergence rate of O(1/ T ) in high probability for general Lipschitz continuous objectives.
6 0.10974785 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
7 0.10393975 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
8 0.08537364 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
9 0.082955599 268 nips-2013-Reflection methods for user-friendly submodular optimization
10 0.061305877 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
11 0.057374097 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
12 0.053191483 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
13 0.052672446 89 nips-2013-Dimension-Free Exponentiated Gradient
14 0.052162725 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
15 0.050660282 75 nips-2013-Convex Two-Layer Modeling
16 0.050396331 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization
17 0.049912997 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
18 0.049659867 99 nips-2013-Dropout Training as Adaptive Regularization
19 0.048253819 318 nips-2013-Structured Learning via Logistic Regression
20 0.044393167 230 nips-2013-Online Learning with Costly Features and Labels
topicId topicWeight
[(0, 0.118), (1, -0.001), (2, 0.09), (3, -0.07), (4, 0.032), (5, 0.026), (6, -0.185), (7, 0.103), (8, 0.183), (9, 0.029), (10, 0.011), (11, 0.096), (12, 0.056), (13, -0.169), (14, -0.01), (15, 0.078), (16, -0.062), (17, 0.077), (18, -0.019), (19, 0.069), (20, 0.047), (21, 0.003), (22, -0.032), (23, 0.001), (24, -0.028), (25, 0.056), (26, -0.009), (27, 0.014), (28, -0.067), (29, -0.021), (30, -0.011), (31, -0.002), (32, -0.044), (33, 0.01), (34, 0.091), (35, 0.086), (36, -0.068), (37, 0.015), (38, 0.004), (39, -0.057), (40, -0.002), (41, -0.056), (42, 0.01), (43, 0.035), (44, -0.053), (45, 0.016), (46, 0.018), (47, -0.032), (48, -0.014), (49, -0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.94295728 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
Author: Rie Johnson, Tong Zhang
Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1
2 0.83387262 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin
Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to find an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1
3 0.82249457 193 nips-2013-Mixed Optimization for Smooth Functions
Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin
Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1
4 0.8120985 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent
Author: Shai Shalev-Shwartz, Tong Zhang
Abstract: Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1
5 0.80674046 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
Author: Tianbao Yang
Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1
6 0.76827508 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
7 0.56839514 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
8 0.56662774 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse
9 0.45589957 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
10 0.42849287 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
11 0.42169929 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
12 0.41941682 318 nips-2013-Structured Learning via Logistic Regression
13 0.41752729 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
14 0.4152959 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
15 0.37915868 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
16 0.37679642 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
17 0.37458405 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
18 0.37205341 268 nips-2013-Reflection methods for user-friendly submodular optimization
19 0.36279514 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
20 0.34283403 89 nips-2013-Dimension-Free Exponentiated Gradient
topicId topicWeight
[(2, 0.014), (16, 0.034), (33, 0.176), (34, 0.079), (41, 0.119), (49, 0.019), (56, 0.086), (57, 0.203), (70, 0.028), (85, 0.019), (89, 0.023), (93, 0.04), (95, 0.035), (99, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.80816454 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
Author: Rie Johnson, Tong Zhang
Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1
2 0.77753937 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
Author: Srini Turaga, Lars Buesing, Adam M. Packer, Henry Dalgleish, Noah Pettit, Michael Hausser, Jakob Macke
Abstract: Simultaneous recordings of the activity of large neural populations are extremely valuable as they can be used to infer the dynamics and interactions of neurons in a local circuit, shedding light on the computations performed. It is now possible to measure the activity of hundreds of neurons using 2-photon calcium imaging. However, many computations are thought to involve circuits consisting of thousands of neurons, such as cortical barrels in rodent somatosensory cortex. Here we contribute a statistical method for “stitching” together sequentially imaged sets of neurons into one model by phrasing the problem as fitting a latent dynamical system with missing observations. This method allows us to substantially expand the population-sizes for which population dynamics can be characterized—beyond the number of simultaneously imaged neurons. In particular, we demonstrate using recordings in mouse somatosensory cortex that this method makes it possible to predict noise correlations between non-simultaneously recorded neuron pairs. 1
3 0.76508272 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
Author: Justin Domke, Xianghang Liu
Abstract: Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.
4 0.76178038 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin
Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to find an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1
5 0.75755835 193 nips-2013-Mixed Optimization for Smooth Functions
Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin
Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1
6 0.75186652 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing
7 0.74777776 11 nips-2013-A New Convex Relaxation for Tensor Completion
8 0.73958498 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
9 0.73248434 257 nips-2013-Projected Natural Actor-Critic
10 0.72436935 189 nips-2013-Message Passing Inference with Chemical Reaction Networks
11 0.72327423 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models
12 0.7226693 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
13 0.7173655 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval
14 0.71688753 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
15 0.71111161 54 nips-2013-Bayesian optimization explains human active search
16 0.70949638 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
17 0.70926231 301 nips-2013-Sparse Additive Text Models with Low Rank Background
18 0.70684826 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
19 0.70572931 201 nips-2013-Multi-Task Bayesian Optimization
20 0.70506847 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result