jmlr jmlr2007 jmlr2007-10 jmlr2007-10-reference knowledge-graph by maker-knowledge-mining

10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

Source: pdf

Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd

Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classiﬁcation problems. In this paper we describe an efﬁcient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efﬁciently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.

reference text

E. Allgower and K. Georg. Continuation and path following. Acta Numerica, 2:1–64, 1993. U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine. Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96:6745–6750, 1999. S. Balakrishnan and D. Madigan. Algorithms for sparse linear classiﬁers in the massive data setting, 2006. Manuscript. Available from www.stat.rutgers.edu/˜madigan/papers/. O. Banerjee, L. El Ghaoui, A. d’Aspremont, and G. Natsoulis. Convex optimization techniques for ﬁtting sparse Gaussian graphical models. In Proceedings of the 23rd International Conference on Machine Learning, 2006. D. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, second edition, 1999. A. Bhusnurmath and C. Taylor. Solving the graph cut problem via 1 norm minimization, 2007. University of Pennsylvania CIS Tech Report number MS-CIS-07-10. J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2000. S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi. A tutorial on geometric programming, 2006. To appear in Optimization and Engineering. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. S. Boyd, L. Vandenberghe, A. El Gamal, and S. Yun. Design of robust global power and ground networks. In Proceedings of ACM/SIGDA International Symposium on Physical Design (ISPD), pages 60–65, 2001. E. Cand` s, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate meae surements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2005. E. Cand` s, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction e from highly incomplete frequency information. IEEE Transactions on Information Theory, 52 (2):489–509, 2006. E. Cand` s and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, e 51(12):4203–4215, 2005. K. Chaloner and K. Larntz. Optimal Bayesian design applied to logistic regression experiments. Journal of Statistical Planning and Inferenc, 21:191–208, 1989. S. Chen and D. Donoho. Basis pursuit. In Proceedings of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, volume 1, pages 41–44, 1994. S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43 (1):129–159, 2001. J. Claerbout and F. Muir. Robust modeling of erratic data. Geophysics, 38(5):826–844, 1973. 1550 A N I NTERIOR -P OINT M ETHOD FOR L ARGE -S CALE 1 -R EGULARIZED L OGISTIC R EGRESSION A. Conn, N. Gould, and Ph. Toint. LANCELOT: A Fortran package for large-scale nonlinear optimization (Release A), volume 17 of Springer Series in Computational Mathematics. SpringerVerlag, 1992. D. Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B, 34(2): 187–220, 1972. J. Dahl, V. Roychowdhury, and L. Vandenberghe. Maximum likelihood estimation of Gaussian graphical models: Numerical implementation and topology selection. Submitted. Available from www.ee.ucla.edu/˜vandenbe/covsel.html, 2005. A. d’Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet. A direct formulation for sparse PCA using semideﬁnite programming, 2005. In L. Saul, Y. Weiss and L. Bottou, editors, Advances in Neural Information Processing Systems, 17, pp. 41-48, MIT Press. R. Dembo and T. Steihaug. Truncated-Newton algorithms for large-scale unconstrained optimization. Math. Program., 26:190–212, 1983. J. Demmel. Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, 1997. D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006. D. Donoho and M. Elad. Optimally sparse representation in general (non-orthogonal) dictionaries via 1 minimization. Proc. Nat. Aca. Sci., 100(5):2197–2202, March 2003. D. Donoho, I. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet shrinkage: Asymptopia? J. R. Statist. Soc. B., 57(2):301–337, 1995. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32 (2):407–499, 2004. J. Friedman, T. Hastie T, and R. Tibshirani. Pathwise coordinate optimization, 2007. Manuscript available from www-stat.stanford.edu/˜hastie/pub.htm. H. Fu, M. Ng, M. Nikolova, and J. Barlow. Efﬁcient minimization methods of mixed 1 - 1 and 2 norms for image restoration. SIAM Journal on Scientiﬁc computing, 27(6):1881–1902, 2006. 1 A. Genkin, D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for text categorization, 2006. To appear in Technometrics. Available from www.stat.rutgers.edu/˜madigan/ papers/. A. George and J. Liu. Computer Solution of Large Sparse Positive Deﬁnite Systems. Prentice-Hall, 1981. P. Gill, W. Murray, M. Saunders, and M. Wright. User’s guide for NPSOL (Version 4.0): A FORTRAN package for nonlinear programming. Technical Report SOL 86-2, Operations Research Dept., Stanford University, Stanford, California 94305, January 1986. 1551 KOH , K IM AND B OYD G. Golub and C. Van Loan. Matrix Computations, volume 13 of Studies in Applied Mathematics. John Hopkins University Press, third edition, 1996. T. Golub, D. Slonim, P. Tamayo, C. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomﬁeld, and E. Lander. Molecular classiﬁcation of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. J. Goodman. Exponential priors for maximum entropy models. In Proceedings of the Annual Meetings of the Association for Computational Linguistics, 2004. A. Hassibi, J. How, and S. Boyd. Low-authority controller design via convex optimization. In Proceedings of the IEEE Conference on Decision and Control, pages 140–145, 1999. T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5:1391–1415, 2004. T. Hastie, J. Taylor, R. Tibshirani, and G. Walther. Forward stagewise regression and the monotone lasso. Electronic Journal of Statistics, 1:1–29, 2007. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer-Verlag, New York, 2001. ISBN 0-387-95284-5. T. Jaakkola and M. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25–37, 2000. S. Keerthi and D. DeCoste. A modiﬁed ﬁnite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005. Y. Kim, J. Kim, and Y. Kim. Blockwise sparse regression. Statistica Sinica, 16:375–390, 2006. P. Komarek. Logistic Regression for Data Mining and High-Dimensional Classiﬁcation. PhD thesis, Carnegie Mellon University, 2004. B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):957–968, 2005. B. Krishnapuram and A. Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Mach. Intelligence, 27(6): 957–968, 2005. ISSN 0162-8828. K. Lang. Newsweeder: Learning to ﬁlter netnews. In Proceedings of the Twenty-First International Conference on Machine learning (ICML), pages 331–339, 1995. S. Lee, H. Lee, P. Abeel, and A. Ng. Efﬁcient l1 -regularized logistic regression. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence (AAAI-06), 2006. S. Levy and P. Fullagar. Reconstruction of a sparse spike train from a portion of its spectrum and application to high-resolution deconvolution. Geophysics, 46(9):1235–1243, 1981. 1552 A N I NTERIOR -P OINT M ETHOD FOR L ARGE -S CALE 1 -R EGULARIZED L OGISTIC R EGRESSION C.-J. Lin, R. Weng, and S. Keerthi. Trust region Newton methods for large-scale logistic regression, 2007. To appear in Proceedings of the 24th International Conference on Machine Learning (ICML). M. Lobo, M. Fazel, and S. Boyd. Portfolio optimization with linear and ﬁxed transaction costs. Annals of Operations Research, 2005. J. Lokhorst. The LASSO and generalised linear models, 1999. Honors Project, Department of Statistics, The University of Adelaide, South Australia, Australia. D. Luenberger. Linear and Nonlinear Programming. Addison-Wesley, second edition, 1984. A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classiﬁcation and clustering. Available from www.cs.cmu.edu/˜mccallum/bow, 1996. P. McCullagh and J. Nelder. Generalized Linear Models. Chapmand & Hall/CRC, second edition, 1989. T. Minka. A comparison of numerical optimizers for logistic regression, 2003. Technical report. Available from research.microsoft.com/˜minka/papers/logreg/. MOSEK ApS. The MOSEK Optimization Tools Version 2.5. User’s Manual and Reference, 2002. Available from www.mosek.com. S. Nash. A survey of truncated-Newton methods. Journal of Computational and Applied Mathematics, 124:45–59, 2000. Y. Nesterov and A. Nemirovsky. Interior-Point Polynomial Methods in Convex Programming, volume 13 of Studies in Applied Mathematics. SIAM, Philadelphia, PA, 1994. D. Newman, S. Hettich, C. Blake, and C. Merz. UCI repository of machine learning databases, 1998. Available from www.ics.uci.edu/˜mlearn/MLRepository.html. A. Ng. Feature selection, 1 vs. 2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine learning (ICML), pages 78–85, New York, NY, USA, 2004. ACM Press. ISBN 1-58113-828-5. J. Nocedal and S. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999. D. Oldenburg, T. Scheuer, and S. Levy. Recovery of the acoustic impedance from reﬂection seismograms. Geophysics, 48(10):1318–1337, 1983. M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3):389–403, 2000. M.-Y. Park and T. Hastie. An 1 regularization-path algorithm for generalized linear models, 2006a. To appear in Journal of the Royal Statistical Society, Series B. Available from www-stat.stanford.edu/˜hastie/pub.htm. 1553 KOH , K IM AND B OYD M.-Y. Park and T. Hastie. Regularization path algorithms for detecting gene interactions, 2006b. Manuscript. Available from www-stat.stanford.edu/˜hastie/Papers/glasso.pdf. S. Perkins and J. Theiler. Online feature selection using grafting. In Proceedings of the Twenty-First International Conference on Machine learning (ICML), pages 592–599. ACM Press, 2003. B. Polyak. Introduction to Optimization. Optimization Software, 1987. Translated from Russian. ´ L. Portugal, M. Resende, G. Veiga, and J. Judice. A truncated primal-infeasible dual-feasible network interior point method. Networks, 35:91–108, 2000. S. Rosset. Tracking curved regularized optimization solution paths. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17. MIT Press, Cambridge, MA, 2005. S. Rosset and J. Zhu. Piecewise linear regularized solution paths, 2007. To appear in Annals of Statistics. S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classiﬁer. Journal of Machine Learning Research, 5:941–973, 2004. V. Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):16–28, 2004. A. Ruszczynski. Nonlinear Optimization. Princeton university press, 2006. T. Ryan. Modern Regression Methods. Wiley, 1997. S. Shevade and S. Keerthi. A simple and efﬁcient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003. N. Z. Shor. Minimization Methods for Non-differentiable Functions. Springer Series in Computational Mathematics. Springer, 1985. H. Taylor, S. Banks, and J. McCoy. Deconvolution with the l 1 norm. Geophysics, 44(1):39–52, 1979. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. R. Tibshirani. The Lasso for variable selection in the Cox model. Statistics in Medicine, 16:385– 395, 1997. R. Tibshirani, M. Saunders, S. Rosset, and J. Zhu. Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society Series B, 67(1):91–108, 2005. J. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3):1030–1051, 2006. L. Vandenberghe and S. Boyd. A primal-dual potential reduction method for problems involving matrix inequalities. Math. Program., 69:205–236, 1995. 1554 A N I NTERIOR -P OINT M ETHOD FOR L ARGE -S CALE 1 -R EGULARIZED L OGISTIC R EGRESSION R. Vanderbei. LOQO User’s Manual — Version 3.10, 1997. Available from www.orfe.princeton. edu/loqo. M. Wainwright, P. Ravikumar, and J. Lafferty. High-dimensional graphical model selection using 1 -regularized logistic regression., 2007. To appear in Advances in Neural Information Processing Systems (NIPS) 19. Available from http://www.eecs.berkeley.edu/˜wainwrig/Pubs/ publist.html#High-dimension%al. S. Wright. Primal-dual Interior-point Methods. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997. ISBN 0-89871-382-X. Y. Ye. Interior Point Algorithms: Theory and Analysis. John Wiley & Sons, 1997. M. Yuan and L. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1):49–67, 2006. Z. Zhang, J. Kwok, and D. Yeung. Surrogate maximization/minimization algorithms for AdaBoost and the logistic regression model. In Proceedings of the Twenty-First International Conference on Machine learning (ICML), pages 927–934, New York, NY, USA, 2004. ACM Press. P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute penalties, 2007. Tech Report 703. Stat Dept. UCB. Available from www.stat.berkeley.edu/ ˜binyu/ps/703.pdf. P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. In S. Thrun, L Saul, and B. Sch¨ lkopf, editors, Advances in Neural Information Processing Systems 16, pages 49–56, o Cambridge, MA, 2004. MIT Press. H. Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):262–286, 2006. H. Zou, T. Hastie, and R. Tibshirani. On the degrees of freedom of the Lasso, 2007. To appear in Annals of Statistics. 1555