jmlr jmlr2011 jmlr2011-87 jmlr2011-87-reference knowledge-graph by maker-knowledge-mining

87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Source: pdf

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The ﬁrst method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

reference text

A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31:167–175, 2003. L. Bottou. Stochastic gradient descent examples, Web Page. projects/sgd. http://leon.bottou.org/ L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 20, pages 161–168, 2008. L. Bottou and Y. LeCunn. On-line learning for very large datasets. Applied Stochastic Models in Business and Industry, 21(2):137–151, 2005. K.L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 922–931, 2008. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efﬁcient projections onto the ℓ1 -ball for learning in high dimensions. In International Conference on Machine Learning, pages 272–279, 2008. J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In Proceedings of the 23rd Annual Conference on Learning Theory, pages 14–26, 2010. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32 (2):407–499, 2004. M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95–110, 1956. J. Friedman, T. Hastie, and R. Tibshirani. Regularized paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. A. Genkin, D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3):291–304, 2007. C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3):265–299, 2003. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, 2011. available at http://www.ise.ufl.edu/glan/papers/ strongSCOSubmit.pdf. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3):173–210, 2001. J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–64, January 1997. 1890 S TOCHASTIC M ETHODS FOR ℓ1 - REGULARIZED L OSS M INIMIZATION K. Koh, S.J. Kim, and S. Boyd. An interior-point method for large-scale ℓ1 -regularized logistic regression. Journal of Machine Learning Research, 8:1519–1555, 2007. G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, pages 1–33, 2010. J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. In Advances in Neural Information Processing Systems 21, pages 905–912, 2009. N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285–318, 1988. Z. Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72:7–35, 1992. A. Nemirovski and D. Yudin. Problem complexity and method efﬁciency in optimization. Nauka Publishers, Moscow, 1978. Y. Nesterov. Efﬁciency of coordinate descent methods on huge-scale optimization problems. Technical Report CORE Discussion Paper 2010/02, Center for Operations Research and Econometrics, UCL, Belgium, 2010. S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University, 2007. S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training set size. In International Conference on Machine Learning, pages 928–935, 2008. S. Shalev-Shwartz and A. Tewari. Stochastic methods for ℓ1 regularized loss minimization. In International Conference on Machine Learning, pages 929–936, 2009. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In International Conference on Machine Learning, pages 807–814, 2007. S. Shalev-Shwartz, T. Zhang, and N. Srebro. Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20:2807–2832, 2010. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1):267–288, 1996. P. Tseng and S. Yun. A block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications, 140:513–535, 2009. M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. A. Olson, J. R. Marks, and J. R. Nevins. Predicting the clinical status of human breast cancer by using gene expression proﬁles. Proceedings of the National Academy of Sciences USA, 98(20):11462–11467, 2001. T. T. Wu and K. Lange. Coordinate descent algorithms for lasso penalized regression. Annals of Applied Statistics, 2(1):224–244, 2008. 1891 S HALEV-S HWARTZ AND T EWARI L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010. T. Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Transaction on Information Theory, 49:682–691, 2003. T. Zhang and F. J. Oles. Text categorization based on regularized linear classiﬁcation methods. Information Retrieval, 4:5–31, 2001. 1892