nips nips2013 nips2013-19 nips2013-19-reference knowledge-graph by maker-knowledge-mining

19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Source: pdf

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1

reference text

Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 5451–5452. IEEE, 2012. Alekh Agarwal, Olivier Chapelle, Miroslav Dud´k, and John Langford. A reliable effective terascale ı linear learning system. arXiv preprint arXiv:1110.4198, 2011. Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning, communication complexity and privacy. arXiv preprint arXiv:1204.3514, 2012. Joseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In ICML, 2011. Andrew Cotter, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. arXiv preprint arXiv:1106.4574, 2011. Hal Daume III, Jeff M Phillips, Avishek Saha, and Suresh Venkatasubramanian. Protocols for learning classiﬁers on distributed data. arXiv preprint arXiv:1202.6078, 2012. Ofer Dekel. Distribution-calibrated hierarchical classiﬁcation. In NIPS, 2010. Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13:165–202, 2012. 2 It should be noted that one can use our results for Lipschitz functions as well by smoothing the loss function (see Nesterov [2005]). By doing so, we can interpolate between the 1/ 2 rate of non-accelerated method and the 1/ rate of accelerated gradient. 3 There are few exceptions in the context of stochastic coordinate descent in the primal. See for example Bradley et al. [2011], Richt´ rik and Tak´ c [2012b] a aˇ 7 John Duchi, Alekh Agarwal, and Martin J Wainwright. Distributed dual averaging in networks. Advances in Neural Information Processing Systems, 23, 2010. Olivier Fercoq and Peter Richt´ rik. Smooth minimization of nonsmooth functions with parallel a coordinate descent methods. arXiv preprint arXiv:1309.5885, 2013. Nicolas Le Roux, Mark Schmidt, and Francis Bach. A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. arXiv preprint arXiv:1202.6258, 2012. Phil Long and Rocco Servedio. Algorithms and hardness results for parallel large margin learning. In NIPS, 2011. Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990, 2010. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716–727, 2012. Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103 (1):127–152, 2005. Yurii Nesterov. Gradient methods for minimizing composite objective function, 2007. Feng Niu, Benjamin Recht, Christopher R´ , and Stephen J Wright. Hogwild!: A lock-free approach e to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730, 2011. Peter Richt´ rik and Martin Tak´ c. Iteration complexity of randomized block-coordinate descent a aˇ methods for minimizing a composite function. Mathematical Programming, pages 1–38, 2012a. Peter Richt´ rik and Martin Tak´ c. Parallel coordinate descent methods for big data optimization. a aˇ arXiv preprint arXiv:1212.0873, 2012b. Peter Richt´ rik and Martin Tak´ c. Distributed coordinate descent method for learning with big data. a aˇ arXiv preprint arXiv:1310.2059, 2013. Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567–599, Feb 2013a. Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. arxiv, 2013b. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In ICML, pages 807–814, 2007. Martin Tak´ c, Avleen Bijral, Peter Richt´ rik, and Nathan Srebro. Mini-batch primal and dual metha a ods for svms. arxiv, 2013. 8