nips nips2012 nips2012-263 nips2012-263-reference knowledge-graph by maker-knowledge-mining

263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Source: pdf

Author: Xi Chen, Qihang Lin, Javier Pena

Abstract: This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the opti1 1 mal rate of O( N + N 2 ) for N iterations, which improves the rate O( log N ) for preN vious regularized dual averaging algorithms. In addition, our method constructs the ﬁnal solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e.g., 1 -norm), it has the advantage of encouraging sparser solutions. We further develop a multistage extension using the proposed algorithm as a subroutine, which achieves the 1 uniformly-optimal rate O( N + exp{−N }) for strongly convex loss. 1

reference text

[1] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in Neural Information Processing Systems (NIPS), 2011.

[2] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Technical report, Microsoft Research, 2011.

[3] J. Duchi, P. L. Bartlett, and M. Wainwright. Randomized smoothing for stochastic optimization. arXiv:1103.4296v1, 2011.

[4] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory (COLT), 2010.

[5] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In Conference on Learning Theory (COLT), 2010.

[6] J. Duchi and Y. Singer. Efﬁcient online and batch learning using forward-backward splitting. Journal of Machine Learning Research, 10:2873–2898, 2009.

[7] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Conference on Learning Theory (COLT), 2011.

[8] C. Hu, J. T. Kwok, and W. Pan. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems (NIPS), 2009.

[9] A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions. August 2010.

[10] G. Lan and S. Ghadimi. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, part i: a generic algorithmic framework. Technical report, University of Florida, 2010.

[11] G. Lan and S. Ghadimi. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, part ii: shrinking procedures and optimal algorithms. Technical report, University of Florida, 2010.

[12] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009.

[13] S. Lee and S. J. Wright. Manifold identiﬁcation of dual averaging methods for regularized stochastic online learning. In International Conference on Machine Learning (ICML), 2011.

[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

[15] A. Nemirovski and D. Yudin. Problem complexity and method efﬁciency in optimization. John Wiley New York, 1983.

[16] Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Pub, 2003.

[17] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120:221–259, 2009.

[18] A. Rakhlin, O. Shamir, and K. Sridharan. To average or not to average? making stochastic gradient descent optimal for strongly convex problems. In International Conference on Machine Learning (ICML), 2012.

[19] R. Tibshirani. Regression shrinkage and selection via the lasso. J.R.Statist.Soc.B, 58:267–288, 1996.

[20] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. SIAM Journal on Optimization (Submitted), 2008.

[21] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.

[22] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. R. Statist. Soc. B, 67(2):301–320, 2005. 9