nips nips2010 nips2010-7 nips2010-7-reference knowledge-graph by maker-knowledge-mining

7 nips-2010-A Family of Penalty Functions for Structured Sparsity


Source: pdf

Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.


reference text

[1] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.

[2] R.G. Baraniuk, V. Cevher, M.F. Duarte, and C. Hegde. Model-based compressive sensing. Information Theory, IEEE Transactions on, 56(4):1982 –2001, 2010.

[3] D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.

[4] P.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37:1705–1732, 2009.

[5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[6] J.M. Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathematics, 14(4):641–664, 1966.

[7] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 417–424. ACM, 2009.

[8] L. Jacob. Structured priors for supervised learning in computational biology. 2009. Ph.D. Thesis.

[9] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In International Conference on Machine Learning (ICML 26), 2009.

[10] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. arXiv:0904.3523v2, 2009.

[11] S. Kim and E.P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. Technical report, 2009. arXiv:0909.1373.

[12] K. Lounici. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electronic Journal of Statistics, 2:90–102, 2008.

[13] K. Lounici, M. Pontil, A.B Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task learning. In Proc. of the 22nd Annual Conference on Learning Theory (COLT), 2009.

[14] A.B. Owen. A robust hybrid of lasso and ridge regression. In Prediction and discovery: AMSIMS-SIAM Joint Summer Research Conference, Machine and Statistical Learning: Prediction and Discovery, volume 443, page 59, 2007.

[15] S.A. van de Geer. High-dimensional generalized linear models and the Lasso. Annals of Statistics, 36(2):614, 2008.

[16] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 68(1):49–67, 2006.

[17] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute penalties. Annals of Statistics, 37(6A):3468–3497, 2009. 9 A Appendix In this appendix we provide the proof of Theorems 3.2 and 4.1. A.1 Proof of Theorem 3.2 Before proving the theorem we require some additional notation. Given any two disjoint subsets J, K ⊆ Nn we define the region QJ,K = β : β ∈ Rn , βK 2 βJ 2 2 2 > |J| |K| . Note that the boundary of this region is determined by the zero set of a homogeneous polynomial of degree two. We also need the following construction. Definition A.1. For every subset S ⊆ Nn−1 we set k = |S| + 1 and label the elements of S in increasing order as S = {jℓ : ℓ ∈ Nk−1 }. We associate with the subset S a contiguous partition of Nn , given by J (S) = {Jℓ : ℓ ∈ Nk }, where we define Jℓ := {jℓ−1 + 1, jℓ } and set j0 = 0 and jk = n. A subset S of Nn−1 also induces two regions in Rn which play a central role in the identification of the wedge penalty. First, we describe the region which “crosses over” the induced partition J (S). This is defined to be the set OS := QJℓ ,Jℓ+1 : ℓ ∈ Nk−1 . (A.1) In other words, β ∈ OS if the average of the square of its components within each region Jℓ strictly decreases with ℓ. The next region which is essential in our analysis is the “stays within” region, induced by the partition J (S). This region requires the notation Jℓ,q := {j : j ∈ Jℓ , j ≤ q} and is defined by the equation IS := QJℓ ,Jℓ,q : q ∈ Jℓ , ℓ ∈ Nk , (A.2) where Q denotes the closure of the set Q. In other words, all vectors β within this region have the property that, for every set Jℓ ∈ J (S), the average of the square of a first segment of components of β within this set is not greater than the average over Jℓ . We note that if S is the empty set the above notation should be interpreted as OS = Rn and IS = {QNn ,Nq : q ∈ Nn }. We also introduce, for every S ∈ Nn−1 the sets US := OS ∩ IS ∩ (R\{0})n . We shall prove the following slightly more general version the Theorem 3.2 Theorem A.1. The collection of sets U := {US : S ⊆ Nn−1 } forms a partition of (R\{0})n . For each β ∈ (R\{0})n there is a unique S ∈ Nn−1 such that β ∈ US , and Ω(β|W ) = ℓ∈Nk |Jℓ | βJℓ 2, (A.3) where k = |S| + 1. Moreover, the components of the vector λ(β) := argmin{Γ(β, λ) : λ ∈ W } is given by the equations λj (β) = µℓ , j ∈ Jℓ , ℓ ∈ Nk , where µℓ = βJ ℓ 2 . |Jℓ | (A.4) Proof. First, let us observe that there are n − 1 inequality constraints defining W . It readily follows that all vectors in this constraint set are regular, in the sense of optimization theory, see [3, p. 279]. Hence, we can appeal to [3, Prop. 3.3.4, p. 316 and Prop. 3.3.6, p. 322], which state that λ ∈ Rn ++ 10 is a solution to the minimum problem determined by the wedge penalty, if and only if there exists a vector α = (αi : i ∈ Nn−1 ) with nonnegative components such that − 2 βj + 1 + αj−1 − αj = 0, j ∈ Nn , λ2 j (A.5) where we set α0 = αn = 0. Furthermore, the following complementary slackness conditions hold true αj (λj+1 − λj ) = 0, j ∈ Nn−1 . (A.6) To unravel these equations, we let S := {j : λj > λj+1 , j ∈ Nn−1 }, which is the subset of indexes corresponding to the constraints that are not tight. When k ≥ 2, we express this set in the form {jℓ : ℓ ∈ Nk−1 } where k = |S| + 1. As explained in Definition A.1, the set S induces the partition J (S) = {Jℓ : ℓ ∈ Nk } of Nn . When k = 1 our notation should be interpreted to mean that S is empty and the partition J (S) consists only of Nn . In this case, it is easy to solve the equations (A.5) and (A.6). In fact, all components of the vector λ have a common value, say µ > 0, and by summing both sides of equation (A.5) over j ∈ Nn we obtain that µ2 = β 2 /n. Moreover, summing both sides of the same equation over 2 2 j ∈ Nq we obtain that αq = − j∈Nq βj /µ2 + q and, since αq ≥ 0 we conclude that β ∈ IS = US . We now consider the case that k ≥ 2. Hence, the vector λ has equal components on each subset Jℓ , which we denote by µℓ , ℓ ∈ Nk−1 . The definition of the set S implies that the µℓ are strictly decreasing and equation (A.6) implies that αj = 0, for every j ∈ S. Summing both sides of equation (A.5) over j ∈ Jℓ we obtain that 1 2 − 2 βj + |Jℓ | = 0 µℓ j∈Jℓ from which equation (A.4) follows. Since the µℓ are strictly decreasing, we conclude that β ∈ OS . Moreover, choosing q ∈ Jℓ and summing both sides of equations (A.5) over j ∈ Jℓ,q we obtain that 0 ≤ αq = − βJℓ,q µ2 ℓ 2 2 + |Jℓ,q | which implies that β ∈ QJℓ ,Jℓ,q . Since this holds for every q ∈ Jℓ and ℓ ∈ Nk we conclude that β ∈ IS and therefore, it follows that β ∈ US . In summary, we have shown that β ∈ US . In particular, this implies that the collection of sets U covers (R\{0})n . Next, we show that the elements of U are disjoint. To this end, we observe that, the computation described above can be reversed. That is to say, conversely for any S ⊆ Nn−1 and β ∈ US we conclude that the vectors α and λ define above solve the equations (A.5) and (A.6). Since the wedge penalty function is strictly convex we know that equations (A.5) and (A.6) have a unique solution. Now, if β ∈ US ∩ US ′ then it must follow that λ = λ′ . Consequently, since the vectors λ and λ′ are a constant on any element of their respective partitions J (S) and J (S ′ ), strictly decreasing from one element to the next in those partition, it must be the case that S1 = S2 . We note that if some components of β are zero we may compute Ω(β|Λ) as a limiting process, since the function Ω(·|Λ) is continuous. Proof of Theorem 4.1 We divide the proof into several steps. To this end, we define Eǫ (β, λ) := y − Xβ 2 2 + 2ρΓ(φǫ (β), λ) and let β(λ) := argmin{Eǫ (α, λ) : α ∈ Rn }. Step 1. We define two sequences, θk = Eǫ (β k , λk−1 ) and νk = Eǫ (β k , λk ) and observe, for any k ≥ 2, that θk+1 ≤ νk ≤ θk ≤ νk−1 . (A.7) These inequalities follow directly from the definition of the alternating algorithm, see equations (4.1) and (4.2). Step 2. We define the compact set B = {β : β ∈ Rn , β 1 ≤ θ1 }. From the first inequality in Proposition 2.1, β 1 ≤ Ω(β|Λ), and inequality (A.7) we conclude, for every k ∈ N, that β k ∈ B. 11 Step 3. We define a function g : Rn → R at β ∈ Rn as g(β) = min {Eǫ (α, λ(φǫ (β))) : α ∈ Rn } . We claim that g is continuous on B. In fact, there exists a constant κ > 0 such that, for every γ 1 , γ 2 ∈ B, it holds that |g(γ 1 ) − g(γ 2 )| ≤ κ λ(φǫ (γ 1 )) − λ(φǫ (γ 2 )) ∞. (A.8) The essential ingredient in the proof of this inequality is the fact that by our hypothesis on the set Λ there exists constant a and b such that, for all β ∈ B, λ(φǫ (β)) ∈ [a, b]n . This fact follows by Danskin’s Theorem [6]. ˜ Step 4. By step 2, there exists a subsequence {β kℓ : ℓ ∈ N} which converges to β ∈ B and, for all β ∈ Rn and λ ∈ Λ, it holds that ˜ ˜ ˜ Eǫ (β, λ(φǫ (β))) ≤ Eǫ (β, λ(φǫ (β))), ˜ ˜ ˜ Eǫ (β, λ(φǫ (β))) ≤ Eǫ (β, λ). (A.9) Indeed, from step 1 we conclude that there exists ψ ∈ R++ such that lim θk = lim νk = ψ. k→∞ k→∞ Under our hypothesis the mapping β → λ(β) is continuous for β ∈ (R\{0})n , we conclude that ˜ lim λkℓ = λ(φǫ (β)). ℓ→∞ By the definition of the alternating algorithm, we have, for all β ∈ Rn and λ ∈ Λ, that θk+1 = Eǫ (β k+1 , λk ) ≤ Eǫ (β, λk ), νk = Eǫ (β k , λk ) ≤ Eǫ (β k , λ). From this inequality we obtain, passing to limit, inequalities (A.9). ˜ ˜ Step 5. The vector (β, λ(φǫ (β)) is a stationary point. Indeed, since Λ is admissible, by step 3, ˜ λ(φǫ (β)) ∈ int(Λ). Therefore, since Eǫ is continuously differentiable this claim follows from step 4. Step 6. The alternating algorithm converges. This claim follows from the fact that Eǫ is strictly convex. Hence, Eǫ has a unique global minimum in Rn × Λ, which in virtue of inequalities (A.9) is ˜ ˜ attained at (β, λ(φǫ (β))). The last claim in the theorem follows from the fact that the set {γ(ǫ) : ǫ > 0} is bounded and the function λ(β) is continuous. 12