jmlr jmlr2013 jmlr2013-76 jmlr2013-76-reference knowledge-graph by maker-knowledge-mining

76 jmlr-2013-Nonparametric Sparsity and Regularization

Source: pdf

Author: Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, Alessandro Verri

Abstract: In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods. Keywords: sparsity, nonparametric, variable selection, regularization, proximal methods, RKHS ∗. Also at Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. c 2013 Lorenzo Rosasco, Silvia Villa, Soﬁa Mosci, Matteo Santoro and Alessandro Verri. ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI

reference text

N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950. H. Attouch and R. Wets. Quantitative stability of variational systems. I. The epigraphical distance. Transactions of the American Mathematical Society, 328(2):695–729, 1991. H. Attouch and R. Wets. Quantitative stability of variational systems. II. A framework for nonlinear conditioning. SIAM Journal on Optimization, 3(2):359–381, 1993a. H. Attouch and R. Wets. Quantitative stability of variational systems. III. ε-approximate solutions. Mathematical Programming, 61(2, Ser. A):197–214, 1993b. F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008. F. Bach. High-dimensional non-linear variable selection through hierarchical kernel learning. Technical Report HAL 00413473, INRIA, 2009. F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the 21st International Machine Learning Conference, Banff, Alberta, Canada, 2004. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. 1709 ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI X ⊆ Rd Y ⊆R output space ρ probability distribution on X × Y ρX marginal distribution of ρ L 2 ( X , ρX ) { f : X → R : measurable and s.t. X f (x)2 dρX (x) < +∞} H Spaces and distributions input space RKHS ⊆ { f : X → Y } · Norms and scalar products n and ·, · · ρX 1 √ · n n and ·, · ρX · H and ·, · H euclidean norm and scalar product norm and scalar product in L2 (X , ρX ) norm and scalar product in H d ΩD : H → [0, +∞) 1 ΩD ( f ) = ∑ 1 ΩD 1 ΩD ( f ) = 1 : H → [0, +∞) ∂ f (x) ∂xa X a=1 d 1 n ∑ a=1 n ∑ i=1 dρX (x) 2 ∂ f (xi ) 2 ∂xa 2 X ( f (x) − y) dρ(x, y) E : H → [0, +∞) E τ : H → [0, +∞) E τ( f ) = ˆ E : H → [0, +∞) ˆ E ( f ) = ∑ 1 ( f (xi ) − yi )2 n ˆ E τ : H → [0, +∞) ˆ E τ ( f ) = ∑ 1 ( f (xi ) − yi )2 + τ(2ΩD ( f ) + ν f 1 n Ik : H → L2 (X , ρX ) (Ik f )(x) = f , kx H ˆ S:H Functionals and Operators E( f ) = ˆ S f = ( f (x1 ), . . . , f (xn )) → Rn n 2 D 2 X ( f (x) − y) dρ(x, y) + τ(2Ω1 ( f ) + ν f H ) i=1 n i=1 Da : H → L2 (X , ρX ) ˆ Da : H → Rn ∇ f = (Da f )d a=1 ˆ ∇ : H → (Rn )d ˆ ˆ ∇ f = (Da f )d a=1 kx : X → R t → k(x,t) † fρ argmin f ∈argmin E {ΩD ( f ) + ν f 2 } 1 H (∂a k)x : X → R t→ fτ the minimizer in H of E τ fˆτ ˆ the minimizer in H of E τ Rρ Sets ˆ Da ( f ) = ∇ : H → (L2 (X , ρX ))d Functions (Da f )(x) = f , (∂a k)x {a ∈ {1, . . . d} : ˆ Rτ {a ∈ {1, . . . d} : Bn {v ∈ Rn : v n ≤ 1} Bd n {v ∈ Rn : v n ≤ 1}d ∂f ∂f ∂xa (x1 ), . . . , ∂xa (xn ) ∂k(s,t) ∂sa s=x † ∂ fρ ∂xa ∂fτ ∂xa = 0} = 0} Table 3: List of symbols and notations 1710 2 H) N ONPARAMETRIC S PARSITY AND R EGULARIZATION S. Becker, J. Bobin, and E. Cand` s. NESTA: a fast and accurate ﬁrst-order method for sparse e recovery. SIAM Journal on Imaging Sciences, 4(1):1–39, 2011. J. Bect, L. Blanc-F´ raud, G. Aubert, and A. Chambolle. A ℓ1 -uniﬁed variational framework for e image restoration. In Proceedings of the 8th European Conference on Computer Vision, Part IV, pages 1–13, Prague, Czech Republic, 2004. M. Belkin and P. Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods. Journal of Computer and System Sciences, 74(8):1289–1308, 2008. K. Bertin and G. Lecu´ . Selection of variables and dimension reduction in high-dimensional none parametric regression. Electronic Journal of Statistics, 2:1224–1241, 2008. P. B¨ hlmann and S. van de Geer. Statistics for High-Dimensional Data. Springer, Berlin, 2011. u J. Cand` s, E.and Romberg and T. Tao. Stable signal recovery from incomplete and inaccurate e measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006. T. Chan, G. Golub, and P. Mulet. A nonlinear primal-dual method for total variation-based image restoration. SIAM Journal on Scientiﬁc Computing, 20, 1999. S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20(1):33–61, 1999. P. Combettes and V. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):1168–1200 (electronic), 2005. P. L. Combettes, D. D˜ ng, and B. C. V˜ . Dualization of signal recovery problems. Set-Valued and u u Variational Analysis, 18(3-4):373–404, 2010. L. Comminges and A. Dalalyan. Tight conditions for consistency of variable selection in the context of high dimensionality. Annals of Statistics, 40(5):2667–2696, 2012. I. Daubechies, G. Teschke, and L. Vese. Iteratively solving linear inverse problems under general convex constraints. Inverse Problems and Imaging, 1(1):29–46, 2007. E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone. Learning from examples as an inverse problem. Journal of Machine Learning Research, 6:883–904, 2005. R. DeVore, G. Petrova, and P. Wojtaszczyk. Approximation of functions of few variables in high dimensions. Constructive Approximation. An International Journal for Approximations and Expansions, 33(1):125–143, 2011. L. Devroye, L. Gy¨ rﬁ, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, o New York, 1996. D. Donoho. Compressed sensing. IEEE. Transactions on Information Theory, 52(4):1289–1306, 2006. A. L. Dontchev and T. Zolezzi. Well-posed Optimization Problems. Springer-Verlag, Berlin, 1993. 1711 ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI J. Duchi and Y. Singer. Efﬁcient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899–2934, December 2009. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32: 407–499, 2004. I. Ekeland and R. Temam. Convex Analysis and Variational Problems. North-Holland Publishing Co., Amsterdam, 1976. M. Figueiredo, R. Nowak, and S. Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing, 1(4):586–597, 2007. C. Gu. Smoothing Spline ANOVA Models. Springer, New York, 2002. E. T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for l1-minimization: Methodology and convergence. SIAM Journal of Optimization, 19(3):1107–1130, 2008. T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall, London, 1990. J.-B. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms: Part I: e Fundamentals. Springer, Berlin, 1993. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010. V. Koltchinskii and M. Yuan. Sparsity in multiple kernel learning. Annals of Statistics, 38(6): 3660–3695, 2010. J. Lafferty and L. Wasserman. Rodeo: sparse, greedy nonparametric regression. Annals of Statistics, 36(1):28–63, 2008. Y. Lin and H. H. Zhang. Component selection and smoothing in multivariate nonparametric regression. Annals of Statistics, 34:2272, 2006. I. Loris. On the performance of algorithms for the minimization of l1 -penalized functionals. Inverse Problems, 25(3):035008, 16, 2009. I. Loris, M. Bertero, C. De Mol, R. Zanella, and L. Zanni. Accelerating gradient projection methods for l1 -constrained signal recovery by steplength selection rules. Applied and Computational Harmonic Analysis, 27(2):247–254, 2009. A. Maurer and M. Pontil. Structured sparsity and generalization. Journal Machine Learning Research, 13:671–690, 2012. H. Miller and P. Hall. Local polynomial regression and variable selection. In J.O. Berger, T.T. Cai, and I.M. Johnstone, editors, Borrowing Strength: Theory Powering Applications—a Festschrift for Lawrence D. Brown, volume 6, pages 216–233. Institute of Mathematical Statistics, 2010. 1712 N ONPARAMETRIC S PARSITY AND R EGULARIZATION J.-J. Moreau. Proximit´ et dualit´ dans un espace hilbertien. Bulletin de la Soci´ t´ Math´ matique e e ee e de France, 93:273–299, 1965. S. Mosci, L. Rosasco, M. Santoro, A. Verri, and S. Villa. Solving structured sparsity regularization with proximal methods. In J.L. Balc` zar, F. Bonchi, A. Gionis, and M. Sebag, editors, Maa chine Learning and Knowledge Discovery in Databases, volume 6322 of LNCS, pages 418–433. Springer Berlin Heidelberg, 2010. A.S. Nemirovski and D.B. Yudin. Problem Complexity and Method Efﬁciency in Optimization. John Wiley & Sons, New York, 1983. Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2 ). Doklady AN SSSR, 269(3):543–547, 1983. Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Paper 2007/76, Catholic University of Louvain, September 2007. I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for probabilities of large deviations. Theory of Probability and its Applications, 30(1):143–148, 1985. P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Spam: Sparse additive models. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems. NIPS Foundation, 2008. L. Rosasco, S. Mosci, M. S. Santoro, A. Verri, and S. Villa. A regularization approach to nonlinear variable selection. In Proceedings of the 13 International Conference on Artiﬁcial Intelligence and Statistics, 2010. S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms. Journal of Convex Analysis, 19(4), 2012. M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 286–297, 2007. I. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics. Springer, New York, 2008. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1):267–288, 1996. J. Tropp and A. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53:4655–4666, 2007. P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming. A Publication of the Mathematical Programming Society, 125(2, Ser. B):263–295, 2010. S. Villa, S. Salzo, L. Baldassarre, and A. Verri. SIOPT. 1713 ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI S. Villa, L. Rosasco, S. Mosci, and A. Verri. Consistency of learning algorithms using Attouch-Wets convergence. Optimization, 61(3):287–305, 2012. G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), 1990. G. Wahba, Y. Wang, C. Gu, R. Klein, and B. Klein. Smoothing spline ANOVA for exponential families, with application to the Wisconsin epidemiological study of diabetic retinopathy. Annals of Statistics, 23:1865–1895, 1995. D.-X. Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220:456–463, 2008. 1714