jmlr jmlr2007 jmlr2007-77 jmlr2007-77-reference knowledge-graph by maker-knowledge-mining

77 jmlr-2007-Stagewise Lasso

Source: pdf

Author: Peng Zhao, Bin Yu

Abstract: Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between ﬁdelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a ﬁxed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a ﬁnite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences

reference text

E.L. Allgower and K. Georg. Homotopy methods for approximating several solutions to nonlinear systems of equations. In W. Forster, editor, Numerical solution of highly nonlinear problems, pages 253–270. North-Holland, 1980. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. L. Breiman. Arcing classiﬁers. The Annals of Statistics, 26:801–824, 1998. P. Buhlmann and B. Yu. Boosting with the l2 loss: regression and classiﬁcation. Journal of American Statistical Association, 98, 2003. E. Candes and T. Tao. The danzig selector: Statistical estimation when p is much larger than n. Annals of Statistics (to appear), 2007. S. Chen and D. Donoho. Basis pursuit. Technical report, Department of Statistics, Stanford University, 1994. N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernelbased learning methods. Cambridge University Press, 2002. D. Donoho. For most large undetermined system of linear equatnions the minimal l 1 -norm nearsolution approximates the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006. D. Donoho, M. Elad, and V. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Information Theory, 52(1):6–18, 2006. B. Efron, T. Hastie, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004. J. Fan and R.Z. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association, 96(456):1348–1360, 2001. I. Frank and J. Friedman. A statistical view od some chemometrics regression tools. Technometrics, 35:109–148, 1993. Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121: 256–285, 1995. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proc. Thirteenth International Conference, pages 148–156. Morgan Kauffman, San Francisco, 1996. J.H. Friedman. Greedy function approximation: a gradient boosting machine. Annal of Statistics, 29:1189–1232, 2001. J.H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annal of Statistics, 28:337–407, 2000. 2724 S TAGEWISE L ASSO W.J. Fu. Penalized regression: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397–416, 2001. J. Gao, H. Suzuki, and B. Yu. Approximate lasso methods for language modeling. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, pages 225–232, 2006. T. Gedeon, A. E. Parker, and A. G. Dimitrov. Information distortion and neural coding. Canadian Applied Mathematics Quarterly, 2002. T. Hastie, Tibshirani, R., and J.H. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Verlag, 2001. T. Hastie, J. Taylor, R. Tibshirani, and G. Walther. Forward stagewise regression and the monotone lasso. Technical report, Department of Statistics, Stanford University, 2006. K. Knight and W. J. Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28:1356–1378, 2000. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. Advance in Large Margin Classiﬁers, 1999. N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the lasso. u Annals of Statistics, 34:1436–1462, 2005. N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics (to appear), 2006. M.R. Osborne, B. Presnell, and B.A. Turlach. A new approach to variable selection in least squares problems. Journal of Numerical Analysis, 20(3):389–403, 2000a. M.R. Osborne, B. Presnell, and B.A. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2):319–337, 2000b. S. Rosset. Tracking curved regularized optimization solution paths. NIPS, 2004. S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classiﬁer. Journal of Machine Learning Research, 5:941–973, 2004. R.E. Schapire. The strength of weak learnability. Journal of Machine Learning, 5(2):1997–2027, 1990. B. Sch¨ lkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, o optimization and beyond. MIT Press, 2002. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In The 37th annual Allerton Conference on Communication, Control and Computing, 1999. 2725 Z HAO AND Y U J.A. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Information Theory, 52(3):1030 –1051, 2006. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using 1 constrained quadratic programming. Technical Report 709, Statistics Department, UC Berkeley, 2006. C.-H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high dimensional linear regression. Annals of Statistics (to appear), 2006. T. Zhang. Sequentiall greedy approximation for certain convex optimization problems. IEEE Trans. on Information Theory, 49(3):682–691, 2003. P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7 (Nov):2541–2563, 2006. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. Advances in Neural Information Processing Systems, 16, 2003. H. Zou. The adaptive lasso and its oracle properties. Journal of American Statistical Association, 101:1418–1429, 2006. 2726