nips nips2009 nips2009-22 nips2009-22-reference knowledge-graph by maker-knowledge-mining

22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning


Source: pdf

Author: Chonghai Hu, Weike Pan, James T. Kwok

Abstract: Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1 -regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1


reference text

[1] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the 24th International Conference on Machine Learning, pages 807–814, Corvalis, Oregon, USA, 2007.

[2] A. Bordes, L. Bottou, and P. Gallinari. SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research, 10:1737–1754, 2009.

[3] J. Duchi and Y. Singer. Online and batch learning using forward looking subgradients. Technical report, 2009.

[4] S. Shalev-Shwartz and A. Tewari. Stochastic methods for ℓ1 regularized loss minimization. In Proceedings of the 26th International Conference on Machine Learning, pages 929–936, Montreal, Quebec, Canada, 2009.

[5] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 20. 2008.

[6] S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training set size. In Proceedings of the 25th International Conference on Machine Learning, pages 928–935, Helsinki, Finland, 2008.

[7] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of con1 vergence o( k2 ). Doklady AN SSSR (translated as Soviet. Math. Docl.), 269:543–547, 1983.

[8] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Paper 2007/76, Catholic University of Louvain, September 2007.

[9] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2:183–202, 2009.

[10] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58:267–288, 1996.

[11] S. Ji, L. Sun, R. Jin, and J. Ye. Multi-label multiple kernel learning. In Advances in Neural Information Processing Systems 21. 2009.

[12] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In Proceedings of the International Conference on Machine Learning. Montreal, Canada, 2009.

[13] G. Lan. An optimal method for stochastic composite optimization. Technical report, School of Industrial and Systems Engineering, Georgia Institute of Technology, 2009.

[14] Y. Nesterov and I.U.E. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, 2003.

[15] S.M. Kakade and S. Shalev-Shwartz. Mind the duality gap: Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems 21. 2009.

[16] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.

[17] S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Las Vegas, Nevada, USA, March 2008.

[18] V. Sindhwani and S.S. Keerthi. Large scale semi-supervised linear SVMs. In Proceedings of the SIGIR Conference on Research and Development in Information Retrieval, pages 477–484, Seattle, WA, USA, 2006.

[19] Y. Song, W.Y. Chen, H. Bai, C.J. Lin, and E.Y. Chang. Parallel spectral clustering. In Proceedings of the European Conference on Machine Learning, pages 374–389, Antwerp, Belgium, 2008. 9