nips nips2010 nips2010-7 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. [sent-12, score-0.238]
2 We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. [sent-13, score-0.481]
3 This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. [sent-14, score-0.185]
4 Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. [sent-16, score-0.329]
5 Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods. [sent-17, score-0.242]
6 In its simplest form, this problem consists in estimating a regression vector β ∗ ∈ Rn from a data vector y ∈ Rm , obtained from the model y = Xβ ∗ + ξ, where X is an m × n matrix, which may be fixed or randomly chosen and ξ ∈ Rm is a vector resulting from the presence of noise. [sent-19, score-0.192]
7 An important rationale for sparse estimation comes from the observation that in many practical applications the number of parameters n is much larger than the data size m, but the vector β ∗ is known to be sparse, that is, most of its components are equal to zero. [sent-20, score-0.154]
8 Under these circumstances, it has been shown that regularization with the ℓ1 norm, commonly referred to as the Lasso method, provides an effective means to estimate the underlying regression vector as well as its sparsity pattern, see for example [4, 12, 15] and references therein. [sent-21, score-0.238]
9 In other words, not only do we expect that β ∗ is sparse but also that it is structured sparse, namely certain configurations of its nonzero components are to be preferred to others. [sent-23, score-0.259]
10 The prior knowledge that we consider in this paper is that the vector |β ∗ |, whose components are the absolute value of the corresponding components of β ∗ , should belong to some prescribed convex set Λ. [sent-33, score-0.356]
11 For certain choices of Λ this implies a constraint on the sparsity pattern as well. [sent-34, score-0.183]
12 To overcome this difficulty, we propose a novel family of penalty functions. [sent-37, score-0.308]
13 It is based on an extension of the ℓ1 norm used by the Lasso method and involves the solution of a smooth convex optimization problem, which incorporates the structured sparsity constraints. [sent-38, score-0.305]
14 As we shall see, a key property of our approach is that the penalty function equals the ℓ1 norm of a vector β when |β| ∈ Λ and it is strictly greater than the ℓ1 norm otherwise. [sent-39, score-0.481]
15 This observation suggests that the penalty function encourages the desired structured sparsity property. [sent-40, score-0.462]
16 Closest to our approach are penalty methods built around the idea of mixed ℓ1 − ℓ2 norms. [sent-42, score-0.275]
17 In particular, the group Lasso method [16] assumes that the components of the underlying regression vector β ∗ can be partitioned into prescribed groups, such that the restriction of β ∗ to a group is equal to zero for most of the groups. [sent-43, score-0.387]
18 Our point of view is different from theirs and provides a means to designing more general and flexible penalty functions which maintain convexity whilst modeling richer model structures. [sent-46, score-0.308]
19 For example, we will demonstrate that our family of penalty functions can model sparsity pattern forming multiple connected regions of coefficients. [sent-47, score-0.42]
20 In particular, we describe the associated penalty function and establish some of its important properties. [sent-50, score-0.306]
21 In Section 3 we provide examples of penalty functions, deriving the explicit analytical form in some important cases, namely the case that the set Λ is a box or the wedge with nonincreasing coordinates. [sent-51, score-0.961]
22 2 Learning method In this section, we introduce the learning method and establish some important properties of the associated penalty function. [sent-54, score-0.306]
23 We prescribe a convex subset Λ of the positive orthant Rn and estimate β ∗ by a ++ solution of the convex optimization problem (2. [sent-56, score-0.156]
24 The penalty function takes the form Ω(β|Λ) = inf {Γ(β, λ) : λ ∈ Λ} (2. [sent-58, score-0.275]
25 Note that Γ is convex on its domain because each of its summands are likewise convex functions. [sent-60, score-0.156]
26 Hence, when the set Λ is convex it follows that Ω(·|Λ) is a convex function and (2. [sent-61, score-0.156]
27 The next proposition provides a justification of the penalty function as a means to incorporate structured sparsity and establish circumstances for which the penalty function is a norm. [sent-73, score-0.824]
28 Moreover, if Λ is a nonempty convex cone then the function Ω(·|Λ) is a norm and we have that Ω(β|Λ) ≤ ω β 1 , where ω := max{Ω(ek |Λ) : k ∈ Nn } and {ek : k ∈ Nn } is the canonical basis of Rn . [sent-104, score-0.224]
29 To prove the second part, observe that if Λ is a nonempty convex cone, namely, for any λ ∈ Λ and t ≥ 0 it holds that tλ ∈ Λ, we have that Ω is positive homogeneous. [sent-112, score-0.159]
30 Indeed, any permutation of the coordinates of a vector β with the above property will incur in the same or a larger value of the penalty term. [sent-119, score-0.324]
31 Moreover, for certain choices of the set Λ, some of which we describe below, the penalty function will encourage vectors which not only are sparse but also have sparsity patterns (1{|βi |>0} : i ∈ Nn ) ∈ Λ, where 1{·} denotes the indicator function. [sent-120, score-0.511]
32 We end this section by noting that a normalized version of the group Lasso penalty [16] is included in our setting as a special case. [sent-121, score-0.383]
33 If {Jℓ : ℓ ∈ Nk }, k ∈ Nn form a partition of the index set Nn , the corresponding group Lasso penalty is defined as ΩGL (β) = ℓ∈Nk |Jℓ | βJℓ 2 , where, for every J ⊆ Nn , we use the notation βJ = (βj : j ∈ J). [sent-122, score-0.525]
34 ++ 3 Examples of the penalty function We proceed to discuss some examples of the set Λ ⊆ Rn which may be used in the design of the ++ penalty function Ω(·|Λ). [sent-124, score-0.55]
35 The first example corresponds to the prior knowledge that the magnitude of the components of the regression vector should be in some prescribed intervals. [sent-128, score-0.227]
36 i∈Nn The theorem below establishes the form of the box penalty; see also [8, 14] for related penalty functions. [sent-132, score-0.439]
37 + + 2ai 2bi + i∈Nn Moreover, the components of the vector λ(β) := argmin{Γ(β, λ) : λ ∈ B[a, b]} are given by the equations λi (β) = |βi | + (ai − |βi |)+ − (|βi | − b)+ , i ∈ Nn . [sent-137, score-0.168]
38 Thus, the box penalty will favor sparsity only for a = 0, case that is defined by a limiting argument. [sent-151, score-0.48]
39 We define the wedge as W = {λ : λ ∈ Rn , λj ≥ λj+1 , j ∈ Nn−1 }. [sent-155, score-0.506]
40 ++ We say that a partition J = {Jℓ : ℓ ∈ Nk } of Nn is contiguous if for all i ∈ Jℓ , j ∈ Jℓ+1 , ℓ ∈ Nk−1 , it holds that i < j. [sent-156, score-0.259]
41 For every β ∈ (R\{0})n there is a unique contiguous partition J = {Jℓ : ℓ ∈ Nk } of Nn , k ∈ Nn , such that Ω(β|W ) = ℓ∈Nk |Jℓ | βJℓ 2. [sent-160, score-0.263]
42 4) The partition J appearing in the theorem is determined by the set of inequalities λj ≥ λj+1 which are an equality at the minimum. [sent-165, score-0.203]
43 4) indicate a strategy to compute the partition associated with a vector β. [sent-171, score-0.179]
44 An interesting property of the Wedge penalty is that it has the form of a group Lasso penalty (see the discussion at the end of Section 2) with groups not fixed a-priori but depending on the location of the vector β. [sent-173, score-0.749]
45 The groups are the elements of the partition J and are identified by certain convex 4 constraints on the vector β. [sent-174, score-0.361]
46 The next example is an extension of the wedge set which is inspired by previous work on the group Lasso estimator with hierarchically overlapping groups [17]. [sent-177, score-0.628]
47 Within this context, the wedge corresponds to the set associated with a line graph. [sent-179, score-0.506]
48 Next, we note that the wedge may equivalently be expressed as the constraint that the difference vector D1 (λ) := (λj+1 −λj : j ∈ Nn−1 ) is less than or equal to zero. [sent-186, score-0.577]
49 ++ The corresponding penalty Ω(·|W k ) encourages vectors whose sparsity pattern is concentrated on at most k different contiguous regions. [sent-191, score-0.523]
50 The case k = 1 essentially corresponds to the wedge, while the case k = 2 includes vectors which have a convex “profile” and whose sparsity pattern is concentrated either on the first elements of the vector, on the last, or on both. [sent-192, score-0.268]
51 We end this section by discussing a useful construction which may be applied to generate new penalty functions from available ones. [sent-193, score-0.303]
52 It is obtained by composing a set Θ ⊆ Rk with a linear ++ transformation, modeling the sum of the components of a vector, across the elements of a prescribed partition {Pℓ : ℓ ∈ Nk } of Nn . [sent-194, score-0.298]
53 We ++ use this construction in the composite wedge experiments in Section 5. [sent-196, score-0.546]
54 Since the penalty function Ω(·|Λ) is constructed as the infimum of a family of quadratic regularizers, the optimization problem (2. [sent-199, score-0.308]
55 For a fixed β, the minimization over λ ∈ Λ requires computing the penalty function (2. [sent-202, score-0.275]
56 , Jk for t = 1 to n do Jk+1 ← {t}; k ← k + 1 βJ βJ 2 2 while k > 1 and √ k−1 ≤ √ k |Jk−1 | |Jk | Jk−1 ← Jk−1 ∪ Jk ; k ← k − 1 end end Figure 2: Iterative algorithm to compute the wedge penalty γ(ǫ) := argmin y − Xβ 2 + 2ρΩ(φǫ (β)|Λ) : β ∈ Rn . [sent-219, score-0.891]
57 2) defining the penalty function Ω(·|Λ) may be reformulated as a second order cone program (SOCP), see e. [sent-224, score-0.336]
58 However, in special cases the computation of the penalty function may be significantly facilitated by using the analytical formulas derived in Section 3. [sent-231, score-0.298]
59 Here, for simplicity we describe how to do this in the case of the wedge penalty. [sent-232, score-0.506]
60 For this purpose we say that a vector β ∈ Rn is admissible if, for √ √ every k ∈ Nn , it holds that βNk 2 / k ≤ β 2 / n. [sent-233, score-0.177]
61 The iterative algorithm presented in Figure 2 can be used to find the partition J = {Jℓ : ℓ ∈ Nk } and, so, the vector λ(β) described in Theorem 3. [sent-238, score-0.179]
62 The latter is the only case that requires further action, so the algorithm merges the last two sets and repeats until the sets in the partition are fully ordered. [sent-244, score-0.219]
63 1 ensures that after each step t the current partition satisfies the conditions (3. [sent-246, score-0.155]
64 Moreover, the while loop ensures that after each step the current partition satisfies, for every ℓ ∈ Nk−1 , the constraints βJℓ 2 |Jℓ | > βJℓ+1 2 |Jℓ+1 |. [sent-248, score-0.222]
65 Since we consider the noiseless case, we solve the interpolation problem min{Ω(β) : y = Xβ}, for different choices of the penalty function Ω. [sent-261, score-0.3]
66 In the following, we discuss the vector β a series of experiments, corresponding to different choices for the model vector β ∗ and its sparsity pattern. [sent-270, score-0.235]
67 In the second experiment, we consider a regression vector, whose components are nonincreasing in absolute value and only a few are nonzero. [sent-287, score-0.192]
68 We compare the Lasso, which makes no use of such ordering information, with the wedge penalty Ω(β|W ) (see Example 3. [sent-289, score-0.781]
69 We further tested the robustness of the methods, by adding two additional nonzero components with value of 10 to the vector β ∗ in a random position between 20 and 100. [sent-298, score-0.165]
70 Next we consider a more complex experiment, where the regression vector is sparse within different contiguous regions P1 , . [sent-301, score-0.219]
71 This method constraints the sum of the sets to be nonincreasing and may be interpreted as the composition of the wedge set with an average operation across the sets Pi , see the discussion at the end of Section 3. [sent-310, score-0.672]
72 penalty Ω(·|W 2 ) (left) and Ω(·|W 3 ) (Right); see text for more information. [sent-313, score-0.275]
73 A problem j=i with these approaches is that the ℓ2 norm is applied at the level of the individual sets Pi , which does not promote sparsity within these sets. [sent-316, score-0.182]
74 To counter this effect we can enforce contiguous nonzero patterns within each of the Pi , as proposed by [10]. [sent-317, score-0.16]
75 That is, we consider as the groups the sets formed by all sequences of q ∈ N9 consecutive elements at the beginning or at the end of each of the sets Pi , for a total of 180 groups. [sent-318, score-0.19]
76 To show the flexibility of our framework, we consider two further examples of sparse regression vectors with additional structured properties. [sent-324, score-0.195]
77 In the first example, most of the components of this vector are zero, but the first and the last few elements follow a discrete convex trend. [sent-325, score-0.235]
78 In this case, we expect the penalty function Ω(β|W 2 ) to outperform the Lasso, because it favors vectors with convex shape. [sent-330, score-0.396]
79 Results are shown in Figure3-e, where this penalty is named “W-2”. [sent-331, score-0.275]
80 In lack of other specific methods to impose this convex shape constraint, and motivating by the fact that the first few components decrease, we compare it with two methods that favors a learned vector that is decreasing: the Wedge and the group Lasso with Jk = {k, . [sent-332, score-0.28]
81 The resulting vector starts with 5 nonzero components and has then a bump of another 15 elements. [sent-343, score-0.165]
82 We use our method with the penalty Ω(β|W 3 ), which is referred to as “W-3” in the Figure. [sent-344, score-0.307]
83 Conclusion We proposed a family of penalty functions that can be used to model structured sparsity in linear regression. [sent-349, score-0.495]
84 We provided theoretical, algorithmic and computational information about this new class of penalty functions. [sent-350, score-0.275]
85 An important feature of our approach is that it can deal with richer model structures than current approaches while maintaining convexity of the penalty function. [sent-352, score-0.308]
86 Our practical experience indicates that these penalties perform well numerically, improving over state of the art penalty methods for structure sparsity, suggesting that our framework is promising for applications. [sent-353, score-0.275]
87 Tree-guided group lasso for multi-task regression with structured sparsity. [sent-426, score-0.509]
88 We associate with the subset S a contiguous partition of Nn , given by J (S) = {Jℓ : ℓ ∈ Nk }, where we define Jℓ := {jℓ−1 + 1, jℓ } and set j0 = 0 and jk = n. [sent-475, score-0.333]
89 A subset S of Nn−1 also induces two regions in Rn which play a central role in the identification of the wedge penalty. [sent-476, score-0.506]
90 First, we describe the region which “crosses over” the induced partition J (S). [sent-477, score-0.169]
91 1) In other words, β ∈ OS if the average of the square of its components within each region Jℓ strictly decreases with ℓ. [sent-480, score-0.165]
92 The next region which is essential in our analysis is the “stays within” region, induced by the partition J (S). [sent-481, score-0.192]
93 In other words, all vectors β within this region have the property that, for every set Jℓ ∈ J (S), the average of the square of a first segment of components of β within this set is not greater than the average over Jℓ . [sent-484, score-0.195]
94 The collection of sets U := {US : S ⊆ Nn−1 } forms a partition of (R\{0})n . [sent-490, score-0.16]
95 Moreover, the components of the vector λ(β) := argmin{Γ(β, λ) : λ ∈ W } is given by the equations λj (β) = µℓ , j ∈ Jℓ , ℓ ∈ Nk , where µℓ = βJ ℓ 2 . [sent-493, score-0.168]
96 322], which state that λ ∈ Rn ++ 10 is a solution to the minimum problem determined by the wedge penalty, if and only if there exists a vector α = (αi : i ∈ Nn−1 ) with nonnegative components such that − 2 βj + 1 + αj−1 − αj = 0, j ∈ Nn , λ2 j (A. [sent-507, score-0.628]
97 In fact, all components of the vector λ have a common value, say µ > 0, and by summing both sides of equation (A. [sent-519, score-0.242]
98 Moreover, summing both sides of the same equation over 2 2 j ∈ Nq we obtain that αq = − j∈Nq βj /µ2 + q and, since αq ≥ 0 we conclude that β ∈ IS = US . [sent-521, score-0.179]
99 That is to say, conversely for any S ⊆ Nn−1 and β ∈ US we conclude that the vectors α and λ define above solve the equations (A. [sent-537, score-0.17]
100 Since the wedge penalty function is strictly convex we know that equations (A. [sent-540, score-0.958]
wordName wordTfidf (topN-words)
[('wedge', 0.506), ('nn', 0.451), ('lasso', 0.309), ('penalty', 0.275), ('nk', 0.224), ('rn', 0.184), ('partition', 0.13), ('sparsity', 0.112), ('jk', 0.11), ('gl', 0.107), ('contiguous', 0.093), ('box', 0.093), ('group', 0.08), ('convex', 0.078), ('os', 0.075), ('structured', 0.075), ('components', 0.073), ('pi', 0.066), ('cone', 0.061), ('prescribed', 0.06), ('alternating', 0.06), ('conclude', 0.059), ('inequality', 0.059), ('argmin', 0.054), ('strictly', 0.053), ('admissible', 0.052), ('nonincreasing', 0.051), ('subsequence', 0.051), ('vector', 0.049), ('sides', 0.048), ('moreover', 0.047), ('equations', 0.046), ('nonempty', 0.045), ('regression', 0.045), ('nonzero', 0.043), ('vectors', 0.043), ('groups', 0.042), ('lin', 0.042), ('summing', 0.042), ('gower', 0.042), ('theorem', 0.041), ('every', 0.04), ('norm', 0.04), ('composite', 0.04), ('region', 0.039), ('london', 0.038), ('micchelli', 0.038), ('qj', 0.036), ('namely', 0.036), ('holds', 0.036), ('elements', 0.035), ('claim', 0.034), ('convexity', 0.033), ('family', 0.033), ('ti', 0.033), ('decreasing', 0.033), ('sparse', 0.032), ('referred', 0.032), ('inequalities', 0.032), ('squares', 0.031), ('proposition', 0.031), ('establish', 0.031), ('sets', 0.03), ('ai', 0.03), ('equation', 0.03), ('establishes', 0.03), ('dantzig', 0.029), ('merges', 0.029), ('offered', 0.029), ('continuously', 0.028), ('bi', 0.028), ('england', 0.028), ('end', 0.028), ('kong', 0.027), ('constraints', 0.027), ('formula', 0.026), ('simulations', 0.026), ('pontil', 0.026), ('ek', 0.026), ('nq', 0.026), ('proof', 0.026), ('choose', 0.026), ('step', 0.025), ('choices', 0.025), ('argyriou', 0.025), ('circumstances', 0.025), ('hong', 0.025), ('merge', 0.025), ('formed', 0.025), ('implies', 0.024), ('patterns', 0.024), ('hierarchical', 0.024), ('shall', 0.024), ('lim', 0.023), ('formulas', 0.023), ('essential', 0.023), ('absolute', 0.023), ('convergent', 0.023), ('constraint', 0.022), ('conversely', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
2 0.1856297 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
Author: Hongbo Zhou, Qiang Cheng
Abstract: Regularization technique has become a principled tool for statistics and machine learning research and practice. However, in most situations, these regularization terms are not well interpreted, especially on how they are related to the loss function and data. In this paper, we propose a robust minimax framework to interpret the relationship between data and regularization terms for a large class of loss functions. We show that various regularization terms are essentially corresponding to different distortions to the original data matrix. This minimax framework includes ridge regression, lasso, elastic net, fused lasso, group lasso, local coordinate coding, multiple kernel learning, etc., as special cases. Within this minimax framework, we further give mathematically exact definition for a novel representation called sparse grouping representation (SGR), and prove a set of sufficient conditions for generating such group level sparsity. Under these sufficient conditions, a large set of consistent regularization terms can be designed. This SGR is essentially different from group lasso in the way of using class or group information, and it outperforms group lasso when there appears group label noise. We also provide some generalization bounds in a classification setting. 1
3 0.16363697 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
Author: Sofia Mosci, Silvia Villa, Alessandro Verri, Lorenzo Rosasco
Abstract: We deal with the problem of variable selection when variables must be selected group-wise, with possibly overlapping groups defined a priori. In particular we propose a new optimization procedure for solving the regularized algorithm presented in [12], where the group lasso penalty is generalized to overlapping groups of variables. While in [12] the proposed implementation requires explicit replication of the variables belonging to more than one group, our iterative procedure is based on a combination of proximal methods in the primal space and projected Newton method in a reduced dual space, corresponding to the active groups. This procedure provides a scalable alternative with no need for data duplication, and allows to deal with high dimensional problems without pre-processing for dimensionality reduction. The computational advantages of our scheme with respect to state-of-the-art algorithms using data duplication are shown empirically with numerical simulations. 1
4 0.15099612 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
5 0.14563917 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
Author: Jun Liu, Jieping Ye
Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-defined tree structure is based on a group-Lasso penalty, where one group is defined for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efficient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efficient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed algorithm.
6 0.14497085 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
7 0.13829274 5 nips-2010-A Dirty Model for Multi-task Learning
8 0.10761323 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
9 0.10442613 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
10 0.093194373 172 nips-2010-Multi-Stage Dantzig Selector
11 0.092413746 258 nips-2010-Structured sparsity-inducing norms through submodular functions
12 0.091165416 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
13 0.08819183 181 nips-2010-Network Flow Algorithms for Structured Sparsity
14 0.086362042 155 nips-2010-Learning the context of a category
15 0.085767739 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
16 0.083931126 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
17 0.082391769 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
18 0.081941657 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates
19 0.079754494 148 nips-2010-Learning Networks of Stochastic Differential Equations
20 0.077775456 45 nips-2010-CUR from a Sparse Optimization Viewpoint
topicId topicWeight
[(0, 0.204), (1, 0.05), (2, 0.122), (3, 0.097), (4, 0.079), (5, -0.148), (6, -0.016), (7, 0.148), (8, -0.068), (9, -0.021), (10, 0.052), (11, 0.157), (12, -0.135), (13, 0.079), (14, 0.014), (15, -0.074), (16, -0.003), (17, 0.132), (18, 0.07), (19, -0.084), (20, -0.02), (21, -0.03), (22, 0.065), (23, 0.079), (24, -0.005), (25, -0.027), (26, -0.06), (27, -0.041), (28, -0.033), (29, -0.055), (30, 0.027), (31, -0.1), (32, 0.045), (33, 0.05), (34, 0.001), (35, 0.019), (36, 0.028), (37, 0.042), (38, 0.084), (39, 0.002), (40, -0.003), (41, 0.065), (42, -0.045), (43, 0.084), (44, 0.058), (45, 0.023), (46, 0.097), (47, 0.105), (48, 0.037), (49, -0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.9550758 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
2 0.85083532 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
Author: Seunghak Lee, Jun Zhu, Eric P. Xing
Abstract: To understand the relationship between genomic variations among population and complex diseases, it is essential to detect eQTLs which are associated with phenotypic effects. However, detecting eQTLs remains a challenge due to complex underlying mechanisms and the very large number of genetic loci involved compared to the number of samples. Thus, to address the problem, it is desirable to take advantage of the structure of the data and prior information about genomic locations such as conservation scores and transcription factor binding sites. In this paper, we propose a novel regularized regression approach for detecting eQTLs which takes into account related traits simultaneously while incorporating many regulatory features. We first present a Bayesian network for a multi-task learning problem that includes priors on SNPs, making it possible to estimate the significance of each covariate adaptively. Then we find the maximum a posteriori (MAP) estimation of regression coefficients and estimate weights of covariates jointly. This optimization procedure is efficient since it can be achieved by using a projected gradient descent and a coordinate descent procedure iteratively. Experimental results on simulated and real yeast datasets confirm that our model outperforms previous methods for finding eQTLs.
3 0.84590787 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
Author: Hongbo Zhou, Qiang Cheng
Abstract: Regularization technique has become a principled tool for statistics and machine learning research and practice. However, in most situations, these regularization terms are not well interpreted, especially on how they are related to the loss function and data. In this paper, we propose a robust minimax framework to interpret the relationship between data and regularization terms for a large class of loss functions. We show that various regularization terms are essentially corresponding to different distortions to the original data matrix. This minimax framework includes ridge regression, lasso, elastic net, fused lasso, group lasso, local coordinate coding, multiple kernel learning, etc., as special cases. Within this minimax framework, we further give mathematically exact definition for a novel representation called sparse grouping representation (SGR), and prove a set of sufficient conditions for generating such group level sparsity. Under these sufficient conditions, a large set of consistent regularization terms can be designed. This SGR is essentially different from group lasso in the way of using class or group information, and it outperforms group lasso when there appears group label noise. We also provide some generalization bounds in a classification setting. 1
4 0.77992493 5 nips-2010-A Dirty Model for Multi-task Learning
Author: Ali Jalali, Sujay Sanghavi, Chao Ruan, Pradeep K. Ravikumar
Abstract: We consider multi-task learning in the setting of multiple linear regression, and where some relevant features could be shared across the tasks. Recent research has studied the use of ℓ1 /ℓq norm block-regularizations with q > 1 for such blocksparse structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the extent to which the features are shared across tasks. Indeed they show [8] that if the extent of overlap is less than a threshold, or even if parameter values in the shared features are highly uneven, then block ℓ1 /ℓq regularization could actually perform worse than simple separate elementwise ℓ1 regularization. Since these caveats depend on the unknown true parameters, we might not know when and which method to apply. Even otherwise, we are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage parameter overlap when it exists, but not pay a penalty when it does not ? Indeed, this falls under a more general question of whether we can model such dirty data which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). With the explosion of such dirty high-dimensional data in modern settings, it is vital to develop tools – dirty models – to perform biased statistical estimation tailored to such data. Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we estimate a superposition of two sets of parameters and regularize them differently. We show both theoretically and empirically, our method strictly and noticeably outperforms both ℓ1 or ℓ1 /ℓq methods, under high-dimensional scaling and over the entire range of possible overlaps (except at boundary cases, where we match the best method). 1 Introduction: Motivation and Setup High-dimensional scaling. In fields across science and engineering, we are increasingly faced with problems where the number of variables or features p is larger than the number of observations n. Under such high-dimensional scaling, for any hope of statistically consistent estimation, it becomes vital to leverage any potential structure in the problem such as sparsity (e.g. in compressed sensing [3] and LASSO [14]), low-rank structure [13, 9], or sparse graphical model structure [12]. It is in such high-dimensional contexts in particular that multi-task learning [4] could be most useful. Here, 1 multiple tasks share some common structure such as sparsity, and estimating these tasks jointly by leveraging this common structure could be more statistically efficient. Block-sparse Multiple Regression. A common multiple task learning setting, and which is the focus of this paper, is that of multiple regression, where we have r > 1 response variables, and a common set of p features or covariates. The r tasks could share certain aspects of their underlying distributions, such as common variance, but the setting we focus on in this paper is where the response variables have simultaneously sparse structure: the index set of relevant features for each task is sparse; and there is a large overlap of these relevant features across the different regression problems. Such “simultaneous sparsity” arises in a variety of contexts [15]; indeed, most applications of sparse signal recovery in contexts ranging from graphical model learning, kernel learning, and function estimation have natural extensions to the simultaneous-sparse setting [12, 2, 11]. It is useful to represent the multiple regression parameters via a matrix, where each column corresponds to a task, and each row to a feature. Having simultaneous sparse structure then corresponds to the matrix being largely “block-sparse” – where each row is either all zero or mostly non-zero, and the number of non-zero rows is small. A lot of recent research in this setting has focused on ℓ1 /ℓq norm regularizations, for q > 1, that encourage the parameter matrix to have such blocksparse structure. Particular examples include results using the ℓ1 /ℓ∞ norm [16, 5, 8], and the ℓ1 /ℓ2 norm [7, 10]. Dirty Models. Block-regularization is “heavy-handed” in two ways. By strictly encouraging sharedsparsity, it assumes that all relevant features are shared, and hence suffers under settings, arguably more realistic, where each task depends on features specific to itself in addition to the ones that are common. The second concern with such block-sparse regularizers is that the ℓ1 /ℓq norms can be shown to encourage the entries in the non-sparse rows taking nearly identical values. Thus we are far away from the original goal of multitask learning: not only do the set of relevant features have to be exactly the same, but their values have to as well. Indeed recent research into such regularized methods [8, 10] caution against the use of block-regularization in regimes where the supports and values of the parameters for each task can vary widely. Since the true parameter values are unknown, that would be a worrisome caveat. We thus ask the question: can we learn multiple regression models by leveraging whatever overlap of features there exist, and without requiring the parameter values to be near identical? Indeed this is an instance of a more general question on whether we can estimate statistical models where the data may not fall cleanly into any one structural bracket (sparse, block-sparse and so on). With the explosion of dirty high-dimensional data in modern settings, it is vital to investigate estimation of corresponding dirty models, which might require new approaches to biased high-dimensional estimation. In this paper we take a first step, focusing on such dirty models for a specific problem: simultaneously sparse multiple regression. Our approach uses a simple idea: while any one structure might not capture the data, a superposition of structural classes might. Our method thus searches for a parameter matrix that can be decomposed into a row-sparse matrix (corresponding to the overlapping or shared features) and an elementwise sparse matrix (corresponding to the non-shared features). As we show both theoretically and empirically, with this simple fix we are able to leverage any extent of shared features, while allowing disparities in support and values of the parameters, so that we are always better than both the Lasso or block-sparse regularizers (at times remarkably so). The rest of the paper is organized as follows: In Sec 2. basic definitions and setup of the problem are presented. Main results of the paper is discussed in sec 3. Experimental results and simulations are demonstrated in Sec 4. Notation: For any matrix M , we denote its j th row as Mj , and its k-th column as M (k) . The set of all non-zero rows (i.e. all rows with at least one non-zero element) is denoted by RowSupp(M ) (k) and its support by Supp(M ). Also, for any matrix M , let M 1,1 := j,k |Mj |, i.e. the sums of absolute values of the elements, and M 1,∞ := j 2 Mj ∞ where, Mj ∞ (k) := maxk |Mj |. 2 Problem Set-up and Our Method Multiple regression. We consider the following standard multiple linear regression model: ¯ y (k) = X (k) θ(k) + w(k) , k = 1, . . . , r, where y (k) ∈ Rn is the response for the k-th task, regressed on the design matrix X (k) ∈ Rn×p (possibly different across tasks), while w(k) ∈ Rn is the noise vector. We assume each w(k) is drawn independently from N (0, σ 2 ). The total number of tasks or target variables is r, the number of features is p, while the number of samples we have for each task is n. For notational convenience, ¯ we collate these quantities into matrices Y ∈ Rn×r for the responses, Θ ∈ Rp×r for the regression n×r parameters and W ∈ R for the noise. ¯ Dirty Model. In this paper we are interested in estimating the true parameter Θ from data by lever¯ aging any (unknown) extent of simultaneous-sparsity. In particular, certain rows of Θ would have many non-zero entries, corresponding to features shared by several tasks (“shared” rows), while certain rows would be elementwise sparse, corresponding to those features which are relevant for some tasks but not all (“non-shared rows”), while certain rows would have all zero entries, corresponding to those features that are not relevant to any task. We are interested in estimators Θ that automatically adapt to different levels of sharedness, and yet enjoy the following guarantees: Support recovery: We say an estimator Θ successfully recovers the true signed support if ¯ sign(Supp(Θ)) = sign(Supp(Θ)). We are interested in deriving sufficient conditions under which ¯ the estimator succeeds. We note that this is stronger than merely recovering the row-support of Θ, which is union of its supports for the different tasks. In particular, denoting Uk for the support of the ¯ k-th column of Θ, and U = k Uk . Error bounds: We are also interested in providing bounds on the elementwise ℓ∞ norm error of the estimator Θ, ¯ Θ−Θ 2.1 ∞ = max max j=1,...,p k=1,...,r (k) Θj (k) ¯ − Θj . Our Method Our method explicitly models the dirty block-sparse structure. We estimate a sum of two parameter matrices B and S with different regularizations for each: encouraging block-structured row-sparsity in B and elementwise sparsity in S. The corresponding “clean” models would either just use blocksparse regularizations [8, 10] or just elementwise sparsity regularizations [14, 18], so that either method would perform better in certain suited regimes. Interestingly, as we will see in the main results, by explicitly allowing to have both block-sparse and elementwise sparse component, we are ¯ able to outperform both classes of these “clean models”, for all regimes Θ. Algorithm 1 Dirty Block Sparse Solve the following convex optimization problem: (S, B) ∈ arg min S,B 1 2n r k=1 y (k) − X (k) S (k) + B (k) 2 2 + λs S 1,1 + λb B 1,∞ . (1) Then output Θ = B + S. 3 Main Results and Their Consequences We now provide precise statements of our main results. A number of recent results have shown that the Lasso [14, 18] and ℓ1 /ℓ∞ block-regularization [8] methods succeed in recovering signed supports with controlled error bounds under high-dimensional scaling regimes. Our first two theorems extend these results to our dirty model setting. In Theorem 1, we consider the case of deterministic design matrices X (k) , and provide sufficient conditions guaranteeing signed support recovery, and elementwise ℓ∞ norm error bounds. In Theorem 2, we specialize this theorem to the case where the 3 rows of the design matrices are random from a general zero mean Gaussian distribution: this allows us to provide scaling on the number of observations required in order to guarantee signed support recovery and bounded elementwise ℓ∞ norm error. Our third result is the most interesting in that it explicitly quantifies the performance gains of our method vis-a-vis Lasso and the ℓ1 /ℓ∞ block-regularization method. Since this entailed finding the precise constants underlying earlier theorems, and a correspondingly more delicate analysis, we follow Negahban and Wainwright [8] and focus on the case where there are two-tasks (i.e. r = 2), and where we have standard Gaussian design matrices as in Theorem 2. Further, while each of two tasks depends on s features, only a fraction α of these are common. It is then interesting to see how the behaviors of the different regularization methods vary with the extent of overlap α. Comparisons. Negahban and Wainwright [8] show that there is actually a “phase transition” in the scaling of the probability of successful signed support-recovery with the number of observations. n Denote a particular rescaling of the sample-size θLasso (n, p, α) = s log(p−s) . Then as Wainwright [18] show, when the rescaled number of samples scales as θLasso > 2 + δ for any δ > 0, Lasso succeeds in recovering the signed support of all columns with probability converging to one. But when the sample size scales as θLasso < 2−δ for any δ > 0, Lasso fails with probability converging to one. For the ℓ1 /ℓ∞ -reguralized multiple linear regression, define a similar rescaled sample size n θ1,∞ (n, p, α) = s log(p−(2−α)s) . Then as Negahban and Wainwright [8] show there is again a transition in probability of success from near zero to near one, at the rescaled sample size of θ1,∞ = (4 − 3α). Thus, for α < 2/3 (“less sharing”) Lasso would perform better since its transition is at a smaller sample size, while for α > 2/3 (“more sharing”) the ℓ1 /ℓ∞ regularized method would perform better. As we show in our third theorem, the phase transition for our method occurs at the rescaled sample size of θ1,∞ = (2 − α), which is strictly before either the Lasso or the ℓ1 /ℓ∞ regularized method except for the boundary cases: α = 0, i.e. the case of no sharing, where we match Lasso, and for α = 1, i.e. full sharing, where we match ℓ1 /ℓ∞ . Everywhere else, we strictly outperform both methods. Figure 3 shows the empirical performance of each of the three methods; as can be seen, they agree very well with the theoretical analysis. (Further details in the experiments Section 4). 3.1 Sufficient Conditions for Deterministic Designs We first consider the case where the design matrices X (k) for k = 1, · · ·, r are deterministic, and start by specifying the assumptions we impose on the model. We note that similar sufficient conditions for the deterministic X (k) ’s case were imposed in papers analyzing Lasso [18] and block-regularization methods [8, 10]. (k) A0 Column Normalization Xj 2 ≤ √ 2n for all j = 1, . . . , p, k = 1, . . . , r. ¯ Let Uk denote the support of the k-th column of Θ, and U = supports for each task. Then we require that k r A1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Xj , XUk (k) (k) XUk , XUk Uk denote the union of −1 c We will also find it useful to define γs := 1−max1≤k≤r maxj∈Uk (k) > 0. 1 k=1 (k) Xj , XUk Note that by the incoherence condition A1, we have γs > 0. A2 Eigenvalue Condition Cmin := min λmin 1≤k≤r A3 Boundedness Condition Dmax := max 1≤k≤r 1 (k) (k) XUk , XUk n 1 (k) (k) XUk , XUk n (k) (k) XUk , XUk −1 . 1 > 0. −1 ∞,1 < ∞. Further, we require the regularization penalties be set as λs > 2(2 − γs )σ log(pr) √ γs n and 4 λb > 2(2 − γb )σ log(pr) √ . γb n (2) 1 0.9 0.8 0.8 Dirty Model L1/Linf Reguralizer Probability of Success Probability of Success 1 0.9 0.7 0.6 0.5 0.4 LASSO 0.3 0.2 0 0.5 1 1.5 1.7 2 2.5 Control Parameter θ 3 3.1 3.5 0.6 0.5 0.4 L1/Linf Reguralizer 0.3 LASSO 0.2 p=128 p=256 p=512 0.1 Dirty Model 0.7 p=128 p=256 p=512 0.1 0 0.5 4 1 1.333 (a) α = 0.3 1.5 2 Control Parameter θ (b) α = 2.5 3 2 3 1 0.9 Dirty Model Probability of Success 0.8 0.7 L1/Linf Reguralizer 0.6 0.5 LASSO 0.4 0.3 0.2 p=128 p=256 p=512 0.1 0 0.5 1 1.2 1.5 1.6 2 Control Parameter θ 2.5 (c) α = 0.8 Figure 1: Probability of success in recovering the true signed support using dirty model, Lasso and ℓ1 /ℓ∞ regularizer. For a 2-task problem, the probability of success for different values of feature-overlap fraction α is plotted. As we can see in the regimes that Lasso is better than, as good as and worse than ℓ1 /ℓ∞ regularizer ((a), (b) and (c) respectively), the dirty model outperforms both of the methods, i.e., it requires less number of observations for successful recovery of the true signed support compared to Lasso and ℓ1 /ℓ∞ regularizer. Here p s = ⌊ 10 ⌋ always. Theorem 1. Suppose A0-A3 hold, and that we obtain estimate Θ from our algorithm with regularization parameters chosen according to (2). Then, with probability at least 1 − c1 exp(−c2 n) → 1, we are guaranteed that the convex program (1) has a unique optimum and (a) The estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ 4σ 2 log (pr) + λs Dmax . n Cmin ≤ bmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that min ¯ (j,k)∈Supp(Θ) ¯(k) θj > bmin . Here the positive constants c1 , c2 depend only on γs , γb , λs , λb and σ, but are otherwise independent of n, p, r, the problem dimensions of interest. Remark: Condition (a) guarantees that the estimate will have no false inclusions; i.e. all included features will be relevant. If in addition, we require that it have no false exclusions and that recover the support exactly, we need to impose the assumption in (b) that the non-zero elements are large enough to be detectable above the noise. 3.2 General Gaussian Designs Often the design matrices consist of samples from a Gaussian ensemble. Suppose that for each task (k) k = 1, . . . , r the design matrix X (k) ∈ Rn×p is such that each row Xi ∈ Rp is a zero-mean Gaussian random vector with covariance matrix Σ(k) ∈ Rp×p , and is independent of every other (k) row. Let ΣV,U ∈ R|V|×|U | be the submatrix of Σ(k) with rows corresponding to V and columns to U . We require these covariance matrices to satisfy the following conditions: r C1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Σj,Uk , ΣUk ,Uk k=1 5 −1 >0 1 C2 Eigenvalue Condition Cmin := min λmin Σ(k),Uk Uk > 0 so that the minimum eigenvalue 1≤k≤r is bounded away from zero. C3 Boundedness Condition Dmax := (k) ΣUk ,Uk −1 ∞,1 < ∞. These conditions are analogues of the conditions for deterministic designs; they are now imposed on the covariance matrix of the (randomly generated) rows of the design matrix. Further, defining s := maxk |Uk |, we require the regularization penalties be set as 1/2 λs > 1/2 4σ 2 Cmin log(pr) √ γs nCmin − 2s log(pr) and λb > 4σ 2 Cmin r(r log(2) + log(p)) . √ γb nCmin − 2sr(r log(2) + log(p)) (3) Theorem 2. Suppose assumptions C1-C3 hold, and that the number of samples scale as n > max 2s log(pr) 2sr r log(2)+log(p) 2 2 Cmin γs , Cmin γb . Suppose we obtain estimate Θ from algorithm (3). Then, with probability at least 1 − c1 exp (−c2 (r log(2) + log(p))) − c3 exp(−c4 log(rs)) → 1 for some positive numbers c1 − c4 , we are guaranteed that the algorithm estimate Θ is unique and satisfies the following conditions: (a) the estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ ≤ 50σ 2 log(rs) + λs nCmin 4s √ + Dmax . Cmin n gmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that 3.3 min ¯ (j,k)∈Supp(Θ) ¯(k) θj > gmin . Sharp Transition for 2-Task Gaussian Designs This is one of the most important results of this paper. Here, we perform a more delicate and finer analysis to establish precise quantitative gains of our method. We focus on the special case where r = 2 and the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ), so that C1 − C3 hold, with Cmin = Dmax = 1. As we will see both analytically and experimentally, our method strictly outperforms both Lasso and ℓ1 /ℓ∞ -block-regularization over for all cases, except at the extreme endpoints of no support sharing (where it matches that of Lasso) and full support sharing (where it matches that of ℓ1 /ℓ∞ ). We now present our analytical results; the empirical comparisons are presented next in Section 4. The results will be in terms of a particular rescaling of the sample size n as θ(n, p, s, α) := n . (2 − α)s log (p − (2 − α)s) We will also require the assumptions that 4σ 2 (1 − F1 λs > F2 λb > s/n)(log(r) + log(p − (2 − α)s)) 1/2 (n)1/2 − (s)1/2 − ((2 − α) s (log(r) + log(p − (2 − α)s)))1/2 4σ 2 (1 − s/n)r(r log(2) + log(p − (2 − α)s)) , 1/2 (n)1/2 − (s)1/2 − ((1 − α/2) sr (r log(2) + log(p − (2 − α)s)))1/2 . Theorem 3. Consider a 2-task regression problem (n, p, s, α), where the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ). 6 Suppose maxj∈B∗ ∗(1) Θj − ∗(2) Θj = o(λs ), where B ∗ is the submatrix of Θ∗ with rows where both entries are non-zero. Then the estimate Θ of the problem (1) satisfies the following: (Success) Suppose the regularization coefficients satisfy F1 − F2. Further, assume that the number of samples scales as θ(n, p, s, α) > 1. Then, with probability at least 1 − c1 exp(−c2 n) for some positive numbers c1 and c2 , we are guaranteed that Θ satisfies the support-recovery and ℓ∞ error bound conditions (a-b) in Theorem 2. ˆ ˆ (Failure) If θ(n, p, s, α) < 1 there is no solution (B, S) for any choices of λs and λb such that ¯ sign Supp(Θ) = sign Supp(Θ) . We note that we require the gap ∗(1) Θj ∗(2) − Θj to be small only on rows where both entries are non-zero. As we show in a more general theorem in the appendix, even in the case where the gap is large, the dependence of the sample scaling on the gap is quite weak. 4 Empirical Results In this section, we investigate the performance of our dirty block sparse estimator on synthetic and real-world data. The synthetic experiments explore the accuracy of Theorem 3, and compare our estimator with LASSO and the ℓ1 /ℓ∞ regularizer. We see that Theorem 3 is very accurate indeed. Next, we apply our method to a real world datasets containing hand-written digits for classification. Again we compare against LASSO and the ℓ1 /ℓ∞ . (a multi-task regression dataset) with r = 2 tasks. In both of this real world dataset, we show that dirty model outperforms both LASSO and ℓ1 /ℓ∞ practically. For each method, the parameters are chosen via cross-validation; see supplemental material for more details. 4.1 Synthetic Data Simulation We consider a r = 2-task regression problem as discussed in Theorem 3, for a range of parameters (n, p, s, α). The design matrices X have each entry being i.i.d. Gaussian with mean 0 and variance 1. For each fixed set of (n, s, p, α), we generate 100 instances of the problem. In each instance, ¯ given p, s, α, the locations of the non-zero entries of the true Θ are chosen at randomly; each nonzero entry is then chosen to be i.i.d. Gaussian with mean 0 and variance 1. n samples are then generated from this. We then attempt to estimate using three methods: our dirty model, ℓ1 /ℓ∞ regularizer and LASSO. In each case, and for each instance, the penalty regularizer coefficients are found by cross validation. After solving the three problems, we compare the signed support of the solution with the true signed support and decide whether or not the program was successful in signed support recovery. We describe these process in more details in this section. Performance Analysis: We ran the algorithm for five different values of the overlap ratio α ∈ 2 {0.3, 3 , 0.8} with three different number of features p ∈ {128, 256, 512}. For any instance of the ˆ ¯ problem (n, p, s, α), if the recovered matrix Θ has the same sign support as the true Θ, then we count it as success, otherwise failure (even if one element has different sign, we count it as failure). As Theorem 3 predicts and Fig 3 shows, the right scaling for the number of oservations is n s log(p−(2−α)s) , where all curves stack on the top of each other at 2 − α. Also, the number of observations required by dirty model for true signed support recovery is always less than both LASSO and ℓ1 /ℓ∞ regularizer. Fig 1(a) shows the probability of success for the case α = 0.3 (when LASSO is better than ℓ1 /ℓ∞ regularizer) and that dirty model outperforms both methods. When α = 2 3 (see Fig 1(b)), LASSO and ℓ1 /ℓ∞ regularizer performs the same; but dirty model require almost 33% less observations for the same performance. As α grows toward 1, e.g. α = 0.8 as shown in Fig 1(c), ℓ1 /ℓ∞ performs better than LASSO. Still, dirty model performs better than both methods in this case as well. 7 4 p=128 p=256 p=512 Phase Transition Threshold 3.5 L1/Linf Regularizer 3 2.5 LASSO 2 Dirty Model 1.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 Shared Support Parameter α 0.7 0.8 0.9 1 Figure 2: Verification of the result of the Theorem 3 on the behavior of phase transition threshold by changing the parameter α in a 2-task (n, p, s, α) problem for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. The y-axis p n is s log(p−(2−α)s) , where n is the number of samples at which threshold was observed. Here s = ⌊ 10 ⌋. Our dirty model method shows a gain in sample complexity over the entire range of sharing α. The pre-constant in Theorem 3 is also validated. n 10 20 40 Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Our Model 8.6% 0.53% B:165 B + S:171 S:18 B + S:1651 3.0% 0.56% B:211 B + S:226 S:34 B + S:2118 2.2% 0.57% B:270 B + S:299 S:67 B + S:2761 ℓ1 /ℓ∞ 9.9% 0.64% 170 1700 3.5% 0.62% 217 2165 3.2% 0.68% 368 3669 LASSO 10.8% 0.51% 123 539 4.1% 0.68% 173 821 2.8% 0.85% 354 2053 Table 1: Handwriting Classification Results for our model, ℓ1 /ℓ∞ and LASSO Scaling Verification: To verify that the phase transition threshold changes linearly with α as predicted by Theorem 3, we plot the phase transition threshold versus α. For five different values of 2 α ∈ {0.05, 0.3, 3 , 0.8, 0.95} and three different values of p ∈ {128, 256, 512}, we find the phase transition threshold for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. We consider the point where the probability of success in recovery of signed support exceeds 50% as the phase transition threshold. We find this point by interpolation on the closest two points. Fig 2 shows that phase transition threshold for dirty model is always lower than the phase transition for LASSO and ℓ1 /ℓ∞ regularizer. 4.2 Handwritten Digits Dataset We use the handwritten digit dataset [1], containing features of handwritten numerals (0-9) extracted from a collection of Dutch utility maps. This dataset has been used by a number of papers [17, 6] as a reliable dataset for handwritten recognition algorithms. There are thus r = 10 tasks, and each handwritten sample consists of p = 649 features. Table 1 shows the results of our analysis for different sizes n of the training set . We measure the classification error for each digit to get the 10-vector of errors. Then, we find the average error and the variance of the error vector to show how the error is distributed over all tasks. We compare our method with ℓ1 /ℓ∞ reguralizer method and LASSO. Again, in all methods, parameters are chosen via cross-validation. For our method we separate out the B and S matrices that our method finds, so as to illustrate how many features it identifies as “shared” and how many as “non-shared”. For the other methods we just report the straight row and support numbers, since they do not make such a separation. Acknowledgements We acknowledge support from NSF grant IIS-101842, and NSF CAREER program, Grant 0954059. 8 References [1] A. Asuncion and D.J. Newman. UCI Machine Learning Repository, http://www.ics.uci.edu/ mlearn/MLRepository.html. University of California, School of Information and Computer Science, Irvine, CA, 2007. [2] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008. [3] R. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24(4):118–121, 2007. [4] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. [5] C.Zhang and J.Huang. Model selection consistency of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36:1567–1594, 2008. [6] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003. [7] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer. Taking advantage of sparsity in multi-task learning. In 22nd Conference On Learning Theory (COLT), 2009. [8] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Benefits and perils of ℓ1,∞ -regularization. In Advances in Neural Information Processing Systems (NIPS), 2008. [9] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML, 2010. [10] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 2010. [11] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society, Series B. [12] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using ℓ1 -regularized logistic regression. Annals of Statistics, 2009. [13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. In Allerton Conference, Allerton House, Illinois, 2007. [14] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [15] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approximation. Signal Processing, Special issue on “Sparse approximations in signal and image processing”, 86:572–602, 2006. [16] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Techno- metrics, 27:349–363, 2005. [17] M. van Breukelen, R.P.W. Duin, D.M.J. Tax, and J.E. den Hartog. Handwritten digit recognition by combined classifiers. Kybernetika, 34(4):381–386, 1998. [18] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1 -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55: 2183–2202, 2009. 9
5 0.72200823 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
Author: Sofia Mosci, Silvia Villa, Alessandro Verri, Lorenzo Rosasco
Abstract: We deal with the problem of variable selection when variables must be selected group-wise, with possibly overlapping groups defined a priori. In particular we propose a new optimization procedure for solving the regularized algorithm presented in [12], where the group lasso penalty is generalized to overlapping groups of variables. While in [12] the proposed implementation requires explicit replication of the variables belonging to more than one group, our iterative procedure is based on a combination of proximal methods in the primal space and projected Newton method in a reduced dual space, corresponding to the active groups. This procedure provides a scalable alternative with no need for data duplication, and allows to deal with high dimensional problems without pre-processing for dimensionality reduction. The computational advantages of our scheme with respect to state-of-the-art algorithms using data duplication are shown empirically with numerical simulations. 1
6 0.70454621 172 nips-2010-Multi-Stage Dantzig Selector
7 0.64733011 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
8 0.62740922 265 nips-2010-The LASSO risk: asymptotic results and real world examples
9 0.59278548 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
10 0.57988322 45 nips-2010-CUR from a Sparse Optimization Viewpoint
11 0.57109976 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
12 0.53400648 148 nips-2010-Learning Networks of Stochastic Differential Equations
13 0.48300299 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
14 0.43028486 258 nips-2010-Structured sparsity-inducing norms through submodular functions
15 0.42536169 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
16 0.42228907 181 nips-2010-Network Flow Algorithms for Structured Sparsity
17 0.42064676 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
18 0.41430891 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
19 0.39756253 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
20 0.38981369 214 nips-2010-Probabilistic Belief Revision with Structural Constraints
topicId topicWeight
[(13, 0.087), (17, 0.036), (27, 0.052), (30, 0.08), (34, 0.087), (35, 0.077), (45, 0.202), (50, 0.063), (52, 0.101), (60, 0.036), (77, 0.055), (78, 0.015), (90, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.91070753 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
2 0.90063626 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
3 0.89821541 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: We propose a class of sparse coding models that utilizes a Laplacian Scale Mixture (LSM) prior to model dependencies among coefficients. Each coefficient is modeled as a Laplacian distribution with a variable scale parameter, with a Gamma distribution prior over the scale parameter. We show that, due to the conjugacy of the Gamma prior, it is possible to derive efficient inference procedures for both the coefficients and the scale parameter. When the scale parameters of a group of coefficients are combined into a single variable, it is possible to describe the dependencies that occur due to common amplitude fluctuations among coefficients, which have been shown to constitute a large fraction of the redundancy in natural images [1]. We show that, as a consequence of this group sparse coding, the resulting inference of the coefficients follows a divisive normalization rule, and that this may be efficiently implemented in a network architecture similar to that which has been proposed to occur in primary visual cortex. We also demonstrate improvements in image coding and compressive sensing recovery using the LSM model. 1
4 0.89063501 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
5 0.88610333 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
Author: Hongbo Zhou, Qiang Cheng
Abstract: Regularization technique has become a principled tool for statistics and machine learning research and practice. However, in most situations, these regularization terms are not well interpreted, especially on how they are related to the loss function and data. In this paper, we propose a robust minimax framework to interpret the relationship between data and regularization terms for a large class of loss functions. We show that various regularization terms are essentially corresponding to different distortions to the original data matrix. This minimax framework includes ridge regression, lasso, elastic net, fused lasso, group lasso, local coordinate coding, multiple kernel learning, etc., as special cases. Within this minimax framework, we further give mathematically exact definition for a novel representation called sparse grouping representation (SGR), and prove a set of sufficient conditions for generating such group level sparsity. Under these sufficient conditions, a large set of consistent regularization terms can be designed. This SGR is essentially different from group lasso in the way of using class or group information, and it outperforms group lasso when there appears group label noise. We also provide some generalization bounds in a classification setting. 1
6 0.88496673 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
7 0.87853521 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
8 0.87715113 258 nips-2010-Structured sparsity-inducing norms through submodular functions
9 0.87627792 148 nips-2010-Learning Networks of Stochastic Differential Equations
10 0.87579155 5 nips-2010-A Dirty Model for Multi-task Learning
11 0.87043887 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
12 0.86872232 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
13 0.86558878 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
14 0.86511856 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
15 0.86443484 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
16 0.86356086 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
17 0.86264861 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
18 0.86224592 280 nips-2010-Unsupervised Kernel Dimension Reduction
19 0.86216742 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication