jmlr jmlr2009 jmlr2009-87 jmlr2009-87-reference knowledge-graph by maker-knowledge-mining

87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Source: pdf

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsiﬁcation from no sparsiﬁcation to total sparsiﬁcation. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsiﬁcation result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and ﬁnd for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

reference text

Arthur Asuncion and David J. Newman. UCI machine learning repository, 2007. University of California, Irvine, School of Information and Computer Sciences, http://www.ics.uci.edu/∼mlearn/MLRepository.html. 799 L ANGFORD , L I AND Z HANG Suhrid Balakrishnan and David Madigan. Algorithms for sparse linear classiﬁers in the massive data setting. Journal of Machine Learning Research, 9:313–337, 2008. Bob Carpenter. Lazy sparse stochastic gradient descent for regularized multinomial logistic regression. Technical report, April 2008. Nicolò Cesa-Bianchi, Philip M. Long, and Manfred Warmuth. Worst-case quadratic loss bounds for prediction using linear functions and gradient descent. IEEE Transactions on Neural Networks, 7(3):604–619, 1996. Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 20 (NIPS-07), 2008. Ofer Dekel, Shai Shalev-Schwartz, and Yoram Singer. The Forgetron: A kernel-based perceptron on a ﬁxed budget. In Advances in Neural Information Processing Systems 18 (NIPS-05), pages 259–266, 2006. John Duchi and Yoram Singer. Online and batch learning using forward looking subgradients. Unpublished manuscript, September 2008. John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efﬁcient projections onto the ℓ1 -ball for learning in high dimensions. In Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML-08), pages 272–279, 2008. Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997. John Langford, Lihong Li, and Alexander L. Strehl. Vowpal Wabbit (fast online learning), 2007. http://hunch.net/∼vw/. Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efﬁcient sparse coding algorithms. In Advances in Neural Information Processing Systems 19 (NIPS-06), pages 801–808, 2007. David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004. Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithms. Machine Learning, 2(4):285–318, 1988. Nick Littlestone, Philip M. Long, and Manfred K. Warmuth. On-line learning of linear functions. Computational Complexity, 5(2):1–23, 1995. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML-07), 2007. Karl Sjöstrand. Matlab implementation of LASSO, LARS, the elastic net and SPCA, June 2005. Version 2.0, http://www2.imm.dtu.dk/pubdb/p.php?3897. 800 S PARSE O NLINE L EARNING VIA T RUNCATED G RADIENT Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B., 58(1):267–288, 1996. Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML-04), pages 919–926, 2004. Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-03), pages 928–936, 2003. 801