jmlr jmlr2011 jmlr2011-88 jmlr2011-88-reference knowledge-graph by maker-knowledge-mining

88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms


Source: pdf

Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach

Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm


reference text

R. J. Adler. An Introduction to Continuity, Extrema, and Related Topics for General Gaussian Processes. IMS, 1990. A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008. F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in Neural Information Processing Systems, 2008a. F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008b. F. Bach. Bolasso: model consistent Lasso estimation through the bootstrap. In Proceedings of the International Conference on Machine Learning (ICML), 2008c. F. Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384– 414, 2009. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms. In Optimization for Machine Learning. MIT press, 2011. To appear. R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 56:1982–2001, 2010. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. 2820 S TRUCTURED VARIABLE S ELECTION WITH S PARSITY-I NDUCING N ORMS D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999. P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37(4):1705–1732, 2009. J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, 2006. S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. P. J. Cameron. Combinatorics: Topics, Techniques, Algorithms. Cambridge University Press, 1994. V. Cevher, M. F. Duarte, C. Hegde, and R. G. Baraniuk. Sparse signal recovery using markov random fields. In Advances in Neural Information Processing Systems, 2008. F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997. P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer, 2010. N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In Proceedings of the European Conference on Computer Vision (ECCV), 2006. J. P. Doignon and J. C. Falmagne. Knowledge Spaces. Springer-Verlag, 1998. C. Dossal. A necessary and sufficient condition for exact recovery by l1 minimization. Technical report, HAL-00164738:1, 2007. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32 (2):407–451, 2004. W. Fu and K. Knight. Asymptotics for Lasso-type estimators. Annals of Statistics, 28(5):1356– 1378, 2000. J. J. Fuchs. Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions on Information Theory, 51(10):3601–3608, 2005. A. Gramfort and M. Kowalski. Improving M/EEG source localization with an inter-condition sparse prior. In IEEE International Symposium on Biomedical Imaging, 2009. H. Harzallah, F. Jurie, and C. Schmid. Combining efficient object localization and image classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2009. L. He and L. Carin. Exploiting structure in wavelet-based Bayesian compressive sensing. IEEE Transactions on Signal Processing, 57:3488–3497, 2009. J. Huang and T. Zhang. The benefit of group sparsity. Annals of Statistics, 38(4):1978–2004, 2010. J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML), 2009. 2821 J ENATTON , AUDIBERT AND BACH L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlaps and graph Lasso. In Proceedings of the International Conference on Machine Learning (ICML), 2009. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the International Conference on Machine Learning (ICML), 2010a. R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2010b. R. Jenatton, A. Gramfort, V. Michel, G. Obozinski, F. Bach, and B. Thirion. Multi-scale mining of fMRI data with hierarchical structured sparsity. In International Workshop on Pattern Recognition in Neuroimaging (PRNI), 2011a. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297–2334, 2011b. K. Kavukcuoglu, M. A. Ranzato, R. Fergus, and Y. Le-Cun. Learning invariant features through topographic filter maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. S. Kim and E. P. Xing. Tree-guided group Lasso for multi-task regression with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML), 2010. H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems, 2007. E. Levina, A. Rothman, and J. Zhu. Sparse estimation of large covariance matrices via a nested Lasso penalty. Annals of Applied Statistics, 2(1):245–263, 2008. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(1):19–60, 2010a. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In Advances in Neural Information Processing Systems, 2010b. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network flow optimization for structured sparsity. Technical report, Preprint arXiv:1104.1872, 2011. To appear in Journal Machine Learning Research. A. F. T. Martins, N. A. Smith, P. M. Q. Aguiar, and M. A. T. Figueiredo. Structured sparsity in structured prediction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011. P. Massart. Concentration Inequalities and Model Selection: Ecole d’´ t´ de Probabilit´ s de Saintee e Flour 23. Springer, 2003. N. Meinshausen and P. B¨ hlmann. Stability selection. Journal of the Royal Statistical Society. u Series B, 72(4):417–473, 2010. 2822 S TRUCTURED VARIABLE S ELECTION WITH S PARSITY-I NDUCING N ORMS C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6(2):1099, 2006. Y. Nardi and A. Rinaldo. On the asymptotic properties of the group Lasso estimator for linear models. Electronic Journal of Statistics, 2:605–633, 2008. S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for highdimensional analysis of M-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, 2009. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007. G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, pages 1–22, 2009. M. R. Osborne, B. Presnell, and B. A. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9:319–337, 2000. A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine Learning Research, 9:2491–2521, 2008. F. Rapaport, E. Barillot, and J.-P. Vert. Classification of arrayCGH data using fused SVM. Bioinformatics, 24(13):i375–i382, Jul 2008. R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970. S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5:941–973, 2004. V. Roth and B. Fischer. The group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In Proceedings of the International Conference on Machine Learning (ICML), 2008. M. Schmidt and K. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2010. P. Soille. Morphological Image Analysis: Principles and Applications. Springer, 2003. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, pages 267–288, 1996. K. C. Toh, M. J. Todd, and R. H. T¨ t¨ nc¨ . SDPT3–a MATLAB software package for semidefinite uu u programming, version 1.3. Optimization Methods and Software, 11(1):545–581, 1999. R. H. T¨ t¨ nc¨ , K. C. Toh, and M. J. Todd. Solving semidefinite-quadratic-linear programs using uu u SDPT3. Mathematical Programming, 95(2):189–217, 2003. G. Varoquaux, R. Jenatton, A. Gramfort, G. Obozinski, B. Thirion, and F. Bach. Sparse structured dictionary learning for brain resting-state activity modeling. In NIPS Workshop on Practical Applications of Sparse Modeling: Open Issues and New Directions, 2010. 2823 J ENATTON , AUDIBERT AND BACH M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1 constrained quadratic programming. IEEE Transactions on Information Theory, 55:2183–2202, 2009. Z. J. Xiang, Y. T. Xi, U. Hasson, and P. J. Ramadge. Boosting with spatial regularization. In Advances in Neural Information Processing Systems, 2009. G. X. Yuan, K. W. Chang, C. J. Hsieh, and C. J. Lin. Comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11:3183–3234, 2010. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B, 68(1):49–67, 2006. T. Zhang. Some sharp performance bounds for least squares regression with l1 regularization. Annals of Statistics, 37(5A):2109–2144, 2009. P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics, 37(6A):3468–3497, 2009. H. Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B, 67(2):301–320, 2005. 2824