nips nips2009 nips2009-73 nips2009-73-reference knowledge-graph by maker-knowledge-mining

73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Source: pdf

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

reference text

[1] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 161â€“168. MIT Press, Cambridge, MA, 2008.

[2] S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training set size. In Proceedings of the 25th International Conference on Machine Learning (ICML), 2008.

[3] L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. Saul, and B. SchÂ¨ lkopf, o editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.

[4] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning (ICML), Banff, Alberta, Canada, 2004.

[5] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777â€“801, 2009.

[6] Yu. Nesterov. Gradient methods for minimizing composiite objective function. CORE Discussion Paper 2007/76, Catholic University of Louvain, Center for Operations Research and Econometrics, 2007.

[7] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Submitted to SIAM Journal on Optimization, 2008.

[8] A. Beck and M. Teboulle. A fast iterative shrinkage-threshold algorithm for linear inverse problems. Technical report, Technion, 2008. To appear in SIAM Journal on Image Sciences.

[9] M. Zinkevich. Online convex programming and generalized inďŹ nitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928â€“ 936, Washington DC, 2003.

[10] N. Cesa-Bianchi and G. Lugosi. Predictioin, Learning, and Games. Cambridge University Press, 2006.

[11] Yu. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221â€“259, 2009. Appeared early as CORE discussion paper 2005/67, Catholic University of Louvain, Center for Operations Research and Econometrics.

[12] Yu. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103:127â€“152, 2005.

[13] L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. Technical Report MSR-TR-2009-100, Microsoft Research, 2009.

[14] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. EfďŹ cient projections onto the â„“1 ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 272â€“279, 2008.

[15] P. Carbonetto, M. Schmidt, and N. De Freitas. An interior-point stochastic approximation method and an đ?‘™1 -regularized delta rule. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 233â€“240. MIT Press, 2009.

[16] S. Balakrishnan and D. Madigan. Algorithms for sparse linear classiďŹ ers in the massive data setting. Journal of Machine Learning Research, 9:313â€“337, 2008.

[17] J. Duchi and Y. Singer. EfďŹ cient learning using forward-backward splitting. In Proceedings of Neural Information Processing Systems, December 2009.

[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278â€“2324, 1998. Dataset available at http://yann.lecun.com/exdb/mnist.

[19] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale â„“1 -regularized logistic regression. Journal of Machine Learning Research, 8:1519â€“1555, 2007.

[20] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574â€“1609, 2009. 9