jmlr jmlr2008 jmlr2008-27 jmlr2008-27-reference knowledge-graph by maker-knowledge-mining

27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Source: pdf

Author: Francis R. Bach

Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufﬁcient conditions for the consistency of group Lasso under practical assumptions, such as model misspeciﬁcation. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this inﬁnite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisﬁed. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators

reference text

F. R. Bach. Bolasso: model consistent Lasso estimation through the bootstrap. In Proceedings of the International Conference on Machine Learning (ICML), 2008a. F. R. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 9: 1019–1048, 2008b. F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the International Conference on Machine Learning (ICML), 2004a. F. R. Bach, R. Thibaux, and M. I. Jordan. Computing regularization paths for learning multiple kernels. In Advances in Neural Information Processing Systems 17, 2004b. C. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:273–289, 1973. A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2003. O. Bousquet and D. J. L. Herrmann. On the complexity of learning the kernel matrix. In Advances in Neural Information Processing Systems 17, 2003. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2003. P. Br´ maud. Markov Chains, Gibbs Fields, Monte Carlo Simulation, and Queues. Springer-Verlag, e 1999. H. Brezis. Analyse Fonctionelle. Masson, 1980. A. Caponnetto and E. de Vito. Fast rates for regularized least-squares algorithm. Technical Report 248/AI Memo 2005-013, CBCL, Massachusetts Institute of Technology, 2005. F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1), 2002. R. Durrett. Probability: Theory and Examples. Duxbury Press, third edition, 2004. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32: 407, 2004. W. Fu and K. Knight. Asymptotics for Lasso-type estimators. Annals of Statistics, 28(5):1356– 1378, 2000. K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004. K. Fukumizu, F. R. Bach, and A. Gretton. Statistical convergence of kernel canonical correlation analysis. Journal of Machine Learning Research, 8(8), 2007. 1223 BACH ¨ A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch olkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075–2129, 12 2005. Z. Harchaoui and F. R. Bach. Image classiﬁcation with segmentation graph kernels. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2007. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001. T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman & Hall, 1990. A. Juditsky and A. Nemirovski. Functional aggregation for nonparametric regression. Annals of Statistics, 28(3):681–712, 2000. G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20:2626–2635, 2004a. G. R. G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M. I. Jordan. Learning the kernel matrix with semideﬁnite programming. Journal of Machine Learning Research, 5:27–72, 2004b. M. S. Lobo, L. Vandenberghe, S. Boyd, and H. L´ bret. Applications of second-order cone programe ming. Linear Algebra and its Applications, 284:193–228, 1998. J. McAuley, J. Ming, D. Stewart, and P. Hanna. Subband correlation and robust speech recognition. IEEE Transactions on Speech and Audio Processing, 13(5):956–964, 2005. ¨ L. Meier, S. van de Geer, and P. Buhlmann. The group Lasso for logistic regression. Technical ¨ Report 131, Eidgen¨ ossische Technische Hochschule (ETH), Zurich, Switzerland, 2006. o N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. Technical Report 720, Departement of Statisics, UC Berkeley, 2006. M. R. Osborne, B. Presnell, and B. A. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2):319–337, 2000. A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. More efﬁciency in multiple kernel learning. In Proceedings of the International Conference on Machine Learning (ICML), 2007. P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. SpAM: Sparse additive models. In Advances in Neural Information Processing Systems 22, 2008. A. Renyi. On Measures of Dependence. Acta Mathematica Academy Sciences Hungary, 10:441– 451, 1959. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001. o S. Sonnenburg, G. R¨ tsch, C. Sch¨ fer, and B. Sch¨ lkopf. Large scale multiple kernel learning. a a o Journal of Machine Learning Research, 7:1531–1565, 07 2006. I. Steinwart. On the inﬂuence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. 1224 C ONSISTENCY OF THE G ROUP L ASSO AND M ULTIPLE K ERNEL L EARNING R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of The Royal Statistical Society Series B, 58(1):267–288, 1996. A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed Problems. V. H. Winston and Sons, 1997. A. W. Van der Vaart. Asymptotic Statistics. Cambridge Univ. Press, 1998. M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), 2007. G. Wahba. Spline Models for Observational Data. SIAM, 1990. M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using 1 constrained quadratic programming. Technical Report 709, Department of Statistics, UC Berkeley, 2006. Q. Wu, Y. Ying, and D.-X. Zhou. Multi-kernel regularized classiﬁers. Journal of Complexity, 23 (1):108–134, 2007. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of The Royal Statistical Society Series B, 68(1):49–67, 2006. M. Yuan and Y. Lin. On the non-negative garrotte estimator. Journal of The Royal Statistical Society Series B, 69(2):143–161, 2007. P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. D. Zhou and C. J. C. Burges. Spectral clustering and transductive learning with multiple views. In Proceedings of the International Conference on Machine Learning (ICML), 2007. H. Zhu, C K. I. Williams, R. Rohwer, and M. Morciniec. Gaussian regression and optimal ﬁnite dimensional linear models. In Neural Networks and Machine Learning. Springer-Verlag, 1998. H. Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429, December 2006. 1225