nips nips2013 nips2013-313 nips2013-313-reference knowledge-graph by maker-knowledge-mining

313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization


Source: pdf

Author: Julien Mairal

Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1


reference text

[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.

[2] J.M. Borwein and A.S. Lewis. Convex analysis and nonlinear optimization. Springer, 2006.

[3] L. Bottou. Online algorithms and stochastic approximations. In David Saad, editor, Online Learning and Neural Networks. 1998.

[4] L. Bottou and O. Bousquet. The trade-offs of large scale learning. In Adv. NIPS, 2008.

[5] O. Capp´ and E. Moulines. On-line expectation–maximization algorithm for latent data models. J. Roy. e Stat. Soc. B, 71(3):593–613, 2009.

[6] J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res., 10:2899–2934, 2009.

[7] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res., 9:1871–1874, 2008.

[8] G. Gasso, A. Rakotomamonjy, and S. Canu. Recovering sparse signals with non-convex penalties and DC programming. IEEE T. Signal Process., 57(12):4686–4698, 2009.

[9] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. Technical report, 2013.

[10] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Mach. Learn., 69(2-3):169–192, 2007.

[11] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proc. COLT, 2011.

[12] C. Hu, J. Kwok, and W. Pan. Accelerated gradient methods for stochastic optimization and online learning. In Adv. NIPS, 2009.

[13] R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In Proc. AISTATS, 2010.

[14] G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133:365–397, 2012.

[15] K. Lange, D.R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. J. Comput. Graph. Stat., 9(1):1–20, 2000.

[16] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. J. Mach. Learn. Res., 10:777–801, 2009.

[17] N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Adv. NIPS, 2012.

[18] J. Mairal. Optimization with first-order surrogate functions. In Proc. ICML, 2013. arXiv:1305.3120.

[19] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 11:19–60, 2010.

[20] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In Adv. NIPS, 2010.

[21] R.M. Neal and G.E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, 89, 1998.

[22] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optimiz., 19(4):1574–1609, 2009.

[23] Y. Nesterov. Gradient methods for minimizing composite objective functions. Technical report, CORE Discussion Paper, 2007.

[24] S. Shalev-Schwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv 1211.2717v1, 2012.

[25] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Proc. COLT, 2009.

[26] S. Shalev-Shwartz and A. Tewari. Stochastic methods for ℓ1 regularized loss minimization. In Proc. ICML, 2009.

[27] A. W. Van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.

[28] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1–305, 2008.

[29] S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE T. Signal Process., 57(7):2479–2493, 2009.

[30] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res., 11:2543–2596, 2010. 9