nips nips2008 nips2008-179 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sahand Negahban, Martin J. Wainwright
Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Joint support recovery under high-dimensional scaling: Benefits and perils of 1,∞-regularization Sahand Negahban Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720-1770 sahand n@eecs. [sent-1, score-0.565]
2 edu Abstract Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. [sent-6, score-0.352]
3 This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. [sent-7, score-0.352]
4 We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. [sent-9, score-0.338]
5 These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. [sent-10, score-0.243]
6 More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. [sent-12, score-0.551]
7 An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). [sent-13, score-0.298]
8 In the absence of additional structure, it is well-known that many standard procedures—among them linear regression and principal component analysis—are not consistent unless the ratio p/n converges to zero. [sent-16, score-0.26]
9 1 This paper deals with high-dimensional scaling in the context of solving multiple regression problems, where the regression vectors are assumed to have shared sparse structure. [sent-18, score-0.547]
10 More specifically, suppose that we are given a collection of r different linear regression models in p dimensions, with regression vectors β i ∈ Rp , for i = 1, . [sent-19, score-0.352]
11 We let S(β i ) = {j | βji = 0} denote the support set of β i . [sent-23, score-0.164]
12 In many applications—among them sparse approximation, graphical model selection, and image reconstruction—it is natural to impose a sparsity constraint, corresponding to restricting the cardinality |S(β i )| of each support set. [sent-24, score-0.293]
13 Moreover, one might expect some amount of overlap between the sets S(β i ) and S(β j ) for indices i = j since they correspond to the sets of active regression coefficients in each problem. [sent-25, score-0.4]
14 Given these structural conditions of shared sparsity in these and other applications, it is reasonable to consider how this common structure can be exploited so as to increase the statistical efficiency of estimation procedures. [sent-30, score-0.272]
15 In this paper, we study the high-dimensional scaling of block 1 / ∞ regularization. [sent-31, score-0.325]
16 Our main contribution is to obtain some precise—and arguably surprising—insights into the benefits and dangers of using block 1 / ∞ regularization, as compared to simpler 1 -regularization (separate Lasso for each regression problem). [sent-32, score-0.395]
17 We begin by providing a general set of sufficient conditions for consistent support recovery for both fixed design matrices, and random Gaussian design matrices. [sent-33, score-0.814]
18 (a) First, under what structural assumptions on the data does the use of 1 / ∞ block-regularization provide a quantifiable reduction in the scaling of the sample size n, as a function of the problem dimension p and other structural parameters, required for consistency? [sent-35, score-0.277]
19 As a representative instance of our theory, consider the special case of standard Gaussian design matrices and two regression problems (r = 2), with the supports S(β 1 ) and S(β 2 ) each of size s and overlapping in a fraction α ∈ [0, 1] of their entries. [sent-40, score-0.585]
20 For this problem, we prove that block 1 / ∞ regularization undergoes a phase transition in terms of the rescaled sample size θ1,∞ (n, p, s, α) := n . [sent-41, score-0.642]
21 By comparison to previous theory on the behavior of the Lasso (ordinary 1 -regularized quadratic programming), the scaling (1) has two interesting implications. [sent-43, score-0.191]
22 On the other hand, our analysis also conveys a cautionary message: if the overlap is too small—more precisely, if α < 2/3—then block 1,∞ is actually harmful relative to the naive Lasso-based approach. [sent-46, score-0.44]
23 This fact illustrates that some care is required in the application of block regularization schemes. [sent-47, score-0.296]
24 2 Problem set-up We begin by setting up the problem to be studied in this paper, including multivariate regression and family of block-regularized programs for estimating sparse vectors. [sent-51, score-0.33]
25 1 Multivariate regression In this problem, we consider the following form of multivariate regression. [sent-53, score-0.212]
26 , r, let β i ∈ Rp be a regression vector, and consider the family of linear observation models y i = X i β i + wi , i = 1, 2, . [sent-57, score-0.255]
27 (3) i n×p Here each X ∈ R is a design matrix, possibly different for each vector β i , and wi ∈ Rn is a noise vector. [sent-61, score-0.144]
28 We assume that the noise vectors wi and wj are independent for different regression problems i = j. [sent-62, score-0.217]
29 We note that all of these block norms are special cases of the CAP family of penalties [12]. [sent-71, score-0.257]
30 This family of block-regularizers (4) suggests a natural family of M -estimators for estimating B, based on solving the block- 1 / q -regularized quadratic program B ∈ arg min p×r B∈R 1 2n r i=1 yi − X i β i 2 2 + λn B 1/ q , (5) where λn > 0 is a user-defined regularization parameter. [sent-72, score-0.305]
31 Note that the data term is separable across the different regression problems i = 1, . [sent-73, score-0.176]
32 Any coupling between the different regression problems is induced by the block-norm regularization. [sent-77, score-0.176]
33 In the special case of univariate regression (r = 1), the parameter q plays no role, and the block-regularized scheme (6) reduces to the Lasso [7, 3]. [sent-78, score-0.176]
34 If q = 1 and r ≥ 2, the block-regularization function (like the data term) is separable across the different regression problems i = 1, . [sent-79, score-0.176]
35 Another important case [9, 8], and the focus of this paper, is block 1 / ∞ regularization. [sent-84, score-0.219]
36 The motivation for using block 1 / ∞ regularization is to encourage shared sparsity among the columns of the regression matrix B. [sent-85, score-0.64]
37 Geometrically, like the 1 norm that underlies the ordinary Lasso, the 1 / ∞ block norm has a polyhedral unit ball. [sent-86, score-0.529]
38 However, the block norm captures potential interactions between the columns β i 1 2 r in the matrix B. [sent-87, score-0.316]
39 In this paper, we study the accuracy of the estimate B, as a function of the sample size n, regression dimensions p and r, and the sparsity index s = maxi=1,. [sent-102, score-0.402]
40 ,r In addition, we prove results on support recovery criteria. [sent-114, score-0.439]
41 Recall that for each vector β i ∈ Rp , we use S(β i ) = i {k | βk = 0} to denote its support set. [sent-115, score-0.164]
42 The problem of union support recovery corresponds to recovering the set r J S(β i ), := (7) i=1 corresponding to the subset J ⊆ {1, . [sent-116, score-0.61]
43 , p} of indices that are active in at least one regression problem. [sent-119, score-0.281]
44 1 / ∞ primal recovery: Solve the block-regularized program (6), thereby obtaining a (primal) optimal solution B ∈ Rp×r , and estimate the signed support vectors [Spri (β i )]k = i sign(βk ). [sent-122, score-0.754]
45 (9) 1 / ∞ dual recovery: Solve the block-regularized program (6), thereby obtaining an primal solution B ∈ i Rp×r . [sent-123, score-0.351]
46 (10) As our development will clarify, this procedure corresponds to estimating the signed support on the basis of a dual optimal solution associated with the optimal primal solution. [sent-132, score-0.685]
47 , r} as a superscript in indexing the different regression problems, or equivalently the columns of the matrix B ∈ Rp×r . [sent-137, score-0.214]
48 4 3 Main results and their consequences In this section, we provide precise statements of the main results of this paper. [sent-144, score-0.191]
49 Our first main result (Theorem 1) provides sufficient conditions for deterministic design matrices X 1 , . [sent-145, score-0.267]
50 This result allows for an arbitrary number r of regression problems. [sent-149, score-0.176]
51 As discussed in the introduction, we are also interested in the more refined question: can we provide necessary and sufficient conditions that are sharp enough to reveal quantitative differences between ordinary 1 regularization and block regularization? [sent-151, score-0.603]
52 In order to provide precise answers to this question, our final two results concern the special case of r = 2 regression problems, both with supports of size s that overlap in a fraction α of their entries, and with design matrices drawn randomly from the standard Gaussian ensemble. [sent-152, score-0.778]
53 In this setting, our final two results (Theorem 2 and 3) show that block 1 / ∞ regularization undergoes a phase transition specified by the rescaled sample size. [sent-153, score-0.597]
54 ||| We assume that the columns of each design matrix X i , i = 1, . [sent-164, score-0.141]
55 We also require the following incoherence condition on the design matrix is satisified: there exists some γ ∈ (0, 1] such that r i i i X i , XJ ( XJ , XJ )−1 max =1,. [sent-172, score-0.205]
56 ,|J c | i=1 1 ≤ (1 − γ), (14) i and we also define the support minimum value Bmin = mink∈J maxi=1,. [sent-175, score-0.164]
57 ,r |βk |, For a parameter ξ > 1 (to be chosen by the user), we define the probability φ1 (ξ, p, s) := 1 − 2 exp(−(ξ − 1)[r + log p]) − 2 exp(−(ξ 2 − 1) log(rs)) (15) which specifies the precise rate with which the “high probability” statements in Theorem 1 hold. [sent-178, score-0.177]
58 Consider the observation model (3) with design matrices X i satisfying the column bound (13) and incoherence condition (14). [sent-180, score-0.237]
59 Then with probability greater than φ1 (ξ, p, s) → 1, n γ2 we are guaranteed that: (a) The block-regularized program has a unique solution B such that elementwise bound max i i max |βk − βk | i=1,. [sent-182, score-0.229]
60 Cmin n b1 (ξ, ρn , n, s) 5 (16) (b) If in addition Bmin ≥ b1 (ξ, ρn , n, s), then the union of supports J. [sent-189, score-0.292]
61 r i=1 S(β i ) = J, so that the solution B correctly specifies Remarks: To clarify the scope of the claims, part (a) guarantees that the estimator recovers the union support i J correctly, whereas part (b) guarantees that for any given i = 1, . [sent-190, score-0.433]
62 However, within the union support J, when / using primal recovery method, it is possible to have false non-zeros—i. [sent-195, score-0.661]
63 Of course, this cannot occur if the support sets S(β i ) are all equal. [sent-198, score-0.164]
64 This phenomenon is j related to geometric properties of the block 1 / ∞ norm: in particular, for any given index k, when βk = 0 for i some j ∈ {1, . [sent-199, score-0.279]
65 The dual signed support recovery method (10) is more conservative in estimating the individual support sets. [sent-203, score-1.042]
66 , r}, it only allows an index k to enter the signed support estimate i Sdua (β i ) when |βk | achieves the maximum magnitude (possibly non-unique) across all indices i = 1, . [sent-207, score-0.604]
67 Consequently, Theorem 1 guarantees that the dual signed support method will never include an index in the individual supports. [sent-211, score-0.699]
68 However, it may incorrectly exclude indices of some supports, but like the primal support estimator, it is always guaranteed to correctly recover the union of supports J. [sent-212, score-0.75]
69 We note that it is possible to ensure that under some conditions that the dual support method will correctly recover each of the individual signed supports, without any incorrect exclusions. [sent-213, score-0.808]
70 However, as illustrated by j i Theorem 2, doing so requires additional assumptions on the size of the gap |βk | − |βk | for indices k ∈ B : = S(β i ) ∩ S(β j ). [sent-214, score-0.194]
71 2 Sharp results for standard Gaussian ensembles Our results thus far show under standard mutual incoherence or irrepresentability conditions, the block 1 / ∞ method produces consistent estimators for n = Ω(s log(p − s)). [sent-216, score-0.361]
72 In qualitative terms, these results match known scaling for the Lasso, or ordinary 1 -regularization. [sent-217, score-0.243]
73 We consider a sequence of models indexed by the triplet (p, s, α), corresponding to the problem dimension p, support sizes s. [sent-225, score-0.246]
74 We can then study the probability of successful recovery as a function of the model triplet, and the sample size n. [sent-230, score-0.356]
75 In order to state our main result, we define the order parameter or rescaled sample size θ1,∞ (n, p, s, α) : = n 1 2 (4−3α)s log(p−(2−α)s) . [sent-231, score-0.177]
76 We also define the support gap value as well as c∞ -gap Bgap = |βB | − |βB | , and 1 c∞ = ρn T (Bgap ) ∞ , where T (Bgap ) = ρn ∧ Bgap . [sent-232, score-0.251]
77 1 Sufficient conditions We begin with a result that provides sufficient conditions for support recovery using block 1 / ∞ regularization. [sent-235, score-0.888]
78 If we solve the block-regularized program (6) with ρn = ξ log p and c∞ → 0 , then with probability n greater than 1 − c1 exp(−c2 log(p − (2 − α)s)), the following properties hold: (i) The block 1,∞ -program (6) has a unique solution (β 1 , β 2 ), with supports S(β 1 ) ⊆ J and S(β 2 ) ⊆ J. [sent-241, score-0.586]
79 ,p ≤ ξ 100 log(s) 4s + ρn √ + 1 , n n b3 (ξ, ρn , n, s) 6 (17) (ii) If the support minimum Bmin > 2b3 (ξ, ρn , n, s), then the primal support method successfully recovers the support union J = S(β 1 )∪S(β 2 ). [sent-245, score-0.714]
80 Moreover, using the primal signed support recovery method (9), we have i [Spri (β i )]k = sign(βk ) for all k ∈ S(β i ). [sent-246, score-0.881]
81 2 Necessary conditions We now turn to the question of finding matching necessary conditions for support recovery. [sent-249, score-0.358]
82 (a) For problem sequences (n, p, s, α) such that θ1,∞ (n, p, s, α) < 1 − δ for some δ > 0 and for any non-increasing regularization sequence ρn > 0, no solution B = (β 1 , β 2 ) to the block-regularized program (6) has the correct support union S(β 1 ) ∪ S(β 2 ). [sent-255, score-0.489]
83 (b) Recalling the lim sup(n,p,s) definition T (Bgap ) √ ρn s 2 of Bgap , define the rescaled gap limit c2 (ρn , Bgap ) := . [sent-256, score-0.183]
84 If the sample size n is bounded as n < (1 − δ) (4 − 3α) + (c2 (ρn , Bgap ))2 s log[p − (2 − α)s] for some δ > 0, then the dual recovery method (10) fails to recover the individual signed supports. [sent-257, score-0.851]
85 However, note that if c2 does not go to 0, then in fact, the method could fail to recover the correct support even if θ1,∞ > 1 + δ. [sent-259, score-0.22]
86 Namely, if the gap is too large, then the sampling efficiency is greatly reduced as compared to if the gap is very small. [sent-262, score-0.174]
87 It shows that if the gap is too large, then correct joint support recovery is not possible. [sent-264, score-0.526]
88 3 Illustrative simulations and some consequences In this section, we provide some illustrative simulations that illustrate the phase transitions predicted by Theorems 2 and 3, and show that the theory provides an accurate description of practice even for relatively small problem sizes (e. [sent-266, score-0.217]
89 Figure 1 plots the probability of successful recovery of the individual signed supports using dual support recovery (10)—namely, P[Sdua (β i ) = S± (β i ), Sdua (β 2 ) = S± (β 2 )] for i = 1, 2— versus the order parameter θ1,∞ (n, p, s, α). [sent-269, score-1.347]
90 The stacking behavior of these curves demonstrates that we have isolated the correct order parameter, and the step-function behavior is consistent with the theoretical predictions of a sharp threshold. [sent-275, score-0.233]
91 Theorems 2 and 3 have some interesting consequences, particularly in comparison to the behavior of the “naive” Lasso-based individual decoding of signed supports—that is, the method that simply applies the Lasso (ordinary 1 -regularization) to each column i = 1, 2 separately. [sent-276, score-0.402]
92 By known results [10] on the Lasso, the performance of this naive approach is governed by the order parameter n θLas (n, p, s) = , (19) 2s log(p − s) meaning that for any δ > 0, it succeeds for sequences such that θLas > 1 + δ, and conversely fails for sequences such that θLas < 1−δ. [sent-277, score-0.177]
93 A value of R < 1 implies that the block method is more efficient, while R > 1 implies that the naive method is more efficient. [sent-279, score-0.279]
94 The relative efficiency of the block 1,∞ program (6) compared to the Lasso is given by R(θ1,∞ , θLas ) = 4−3α log(p−(2−α)s) . [sent-281, score-0.328]
95 Thus, for sublinear sparsity s/p → 0, the block scheme has greater 2 log(p−s) statistical efficiency for all overlaps α ∈ (2/3, 1], but lower statistical efficiency for overlaps α ∈ [0, 2/3). [sent-282, score-0.388]
96 Probability of success in recovering the joint signed supports plotted against the control parameter θ1,∞ = n/[2s log(p − (2 − α)s))] for linear sparsity s = 0. [sent-296, score-0.67]
97 Each stack of graphs corresponds to a fixed overlap α, as labeled on the figure. [sent-298, score-0.221]
98 The three curves within each stack correspond to problem sizes p{128, 256, 512}; note how they all align with each other and exhibit step-like behavior, consistent with Theorems 2 and 3. [sent-299, score-0.215]
99 Joint support recovery under high-dimensional scaling: Benefits and perils of 1,∞ -regularization. [sent-335, score-0.502]
100 Sharp thresholds for high-dimensional and noisy recovery of sparsity using using constrained quadratic programs. [sent-368, score-0.403]
wordName wordTfidf (topN-words)
[('signed', 0.318), ('recovery', 0.275), ('bgap', 0.251), ('block', 0.219), ('supports', 0.194), ('lasso', 0.191), ('las', 0.188), ('regression', 0.176), ('support', 0.164), ('ordinary', 0.137), ('sdua', 0.126), ('primal', 0.124), ('rp', 0.122), ('overlap', 0.119), ('program', 0.109), ('scaling', 0.106), ('design', 0.103), ('designs', 0.102), ('stack', 0.102), ('union', 0.098), ('conditions', 0.097), ('rescaled', 0.096), ('sign', 0.095), ('bmin', 0.094), ('scalings', 0.094), ('berkeley', 0.088), ('gap', 0.087), ('sparsity', 0.085), ('dual', 0.079), ('consequences', 0.078), ('regularization', 0.077), ('cmin', 0.076), ('precise', 0.074), ('sharp', 0.073), ('recovering', 0.073), ('undergoes', 0.071), ('matrices', 0.067), ('incoherence', 0.067), ('log', 0.064), ('negahban', 0.063), ('perils', 0.063), ('sahand', 0.063), ('spri', 0.063), ('indices', 0.062), ('uc', 0.061), ('ciency', 0.061), ('phase', 0.06), ('naive', 0.06), ('index', 0.06), ('norm', 0.059), ('recover', 0.056), ('theorems', 0.055), ('polyhedral', 0.055), ('correctly', 0.052), ('rs', 0.052), ('elementwise', 0.05), ('dmax', 0.05), ('converges', 0.048), ('clarify', 0.047), ('xj', 0.046), ('theorem', 0.046), ('size', 0.045), ('shared', 0.045), ('structural', 0.045), ('triplet', 0.045), ('sparse', 0.044), ('active', 0.043), ('quadratic', 0.043), ('overlaps', 0.042), ('harmful', 0.042), ('illustrative', 0.042), ('individual', 0.042), ('behavior', 0.042), ('wi', 0.041), ('sequences', 0.041), ('cg', 0.041), ('disease', 0.041), ('curves', 0.04), ('statements', 0.039), ('ensembles', 0.039), ('bk', 0.039), ('notation', 0.039), ('thereby', 0.039), ('columns', 0.038), ('family', 0.038), ('transition', 0.038), ('sizes', 0.037), ('gaussian', 0.037), ('maxi', 0.037), ('begin', 0.036), ('consistent', 0.036), ('sample', 0.036), ('guarantees', 0.036), ('multivariate', 0.036), ('meaning', 0.035), ('coef', 0.035), ('max', 0.035), ('covariate', 0.035), ('donoho', 0.035), ('department', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 179 nips-2008-Phase transitions for high-dimensional joint support recovery
Author: Sahand Negahban, Martin J. Wainwright
Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1
2 0.37169451 99 nips-2008-High-dimensional support union recovery in multivariate regression
Author: Guillaume R. Obozinski, Martin J. Wainwright, Michael I. Jordan
Abstract: We study the behavior of block 1 / 2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Studying this problem under high-dimensional scaling (where the problem parameters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θ 1 / 2 (n, p, s) : = n/[2ψ(B ∗ ) log(p − s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B ∗ ) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block 1 / 2 regularization for multivariate regression never harms performance relative to a naive 1 -approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal relative to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems. 1
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar
Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.
4 0.20395224 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
5 0.16978638 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
Author: Tong Zhang
Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1
6 0.15991321 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
7 0.14663361 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
8 0.13219996 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions
9 0.10771056 106 nips-2008-Inferring rankings under constrained sensing
10 0.10134763 214 nips-2008-Sparse Online Learning via Truncated Gradient
11 0.085814431 62 nips-2008-Differentiable Sparse Coding
12 0.085209057 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints
13 0.080269255 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
14 0.078900404 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
15 0.07612934 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
16 0.075052746 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
17 0.073672734 231 nips-2008-Temporal Dynamics of Cognitive Control
18 0.071727119 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
19 0.071655095 48 nips-2008-Clustering via LP-based Stabilities
20 0.071380556 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
topicId topicWeight
[(0, -0.266), (1, -0.036), (2, -0.149), (3, 0.188), (4, 0.18), (5, -0.011), (6, -0.197), (7, -0.036), (8, -0.093), (9, 0.139), (10, -0.242), (11, -0.113), (12, -0.081), (13, -0.078), (14, -0.07), (15, 0.103), (16, -0.014), (17, -0.078), (18, 0.011), (19, 0.012), (20, 0.001), (21, -0.024), (22, 0.008), (23, 0.001), (24, 0.07), (25, 0.044), (26, 0.051), (27, 0.123), (28, 0.048), (29, 0.064), (30, 0.084), (31, 0.071), (32, -0.039), (33, -0.177), (34, 0.038), (35, -0.043), (36, -0.003), (37, 0.109), (38, 0.013), (39, -0.041), (40, 0.114), (41, 0.087), (42, 0.05), (43, 0.114), (44, 0.046), (45, -0.05), (46, 0.113), (47, -0.012), (48, -0.004), (49, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.97299802 179 nips-2008-Phase transitions for high-dimensional joint support recovery
Author: Sahand Negahban, Martin J. Wainwright
Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1
2 0.95393634 99 nips-2008-High-dimensional support union recovery in multivariate regression
Author: Guillaume R. Obozinski, Martin J. Wainwright, Michael I. Jordan
Abstract: We study the behavior of block 1 / 2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Studying this problem under high-dimensional scaling (where the problem parameters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θ 1 / 2 (n, p, s) : = n/[2ψ(B ∗ ) log(p − s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B ∗ ) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block 1 / 2 regularization for multivariate regression never harms performance relative to a naive 1 -approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal relative to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems. 1
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar
Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.
4 0.74032885 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
5 0.59172189 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
Author: Pierre Garrigues, Laurent E. Ghaoui
Abstract: It has been shown that the problem of 1 -penalized least-square regression commonly referred to as the Lasso or Basis Pursuit DeNoising leads to solutions that are sparse and therefore achieves model selection. We propose in this paper RecLasso, an algorithm to solve the Lasso with online (sequential) observations. We introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. We compare our method to Lars and Coordinate Descent, and present an application to compressive sensing with sequential observations. Our approach can easily be extended to compute an homotopy from the current solution to the solution that corresponds to removing a data point, which leads to an efficient algorithm for leave-one-out cross-validation. We also propose an algorithm to automatically update the regularization parameter after observing a new data point. 1
6 0.55674934 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
7 0.48808861 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
8 0.43207204 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
9 0.42236856 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions
10 0.4104816 106 nips-2008-Inferring rankings under constrained sensing
11 0.40522683 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
12 0.37150723 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints
13 0.36948472 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions
14 0.36898485 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
15 0.3620401 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
16 0.36135256 185 nips-2008-Privacy-preserving logistic regression
17 0.3561582 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
18 0.35379741 85 nips-2008-Fast Rates for Regularized Objectives
19 0.3440946 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
20 0.33257294 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
topicId topicWeight
[(6, 0.091), (7, 0.08), (12, 0.046), (16, 0.017), (24, 0.211), (27, 0.016), (28, 0.173), (57, 0.05), (59, 0.048), (63, 0.027), (71, 0.04), (77, 0.05), (83, 0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.85614562 179 nips-2008-Phase transitions for high-dimensional joint support recovery
Author: Sahand Negahban, Martin J. Wainwright
Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1
2 0.76851779 99 nips-2008-High-dimensional support union recovery in multivariate regression
Author: Guillaume R. Obozinski, Martin J. Wainwright, Michael I. Jordan
Abstract: We study the behavior of block 1 / 2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Studying this problem under high-dimensional scaling (where the problem parameters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θ 1 / 2 (n, p, s) : = n/[2ψ(B ∗ ) log(p − s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B ∗ ) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block 1 / 2 regularization for multivariate regression never harms performance relative to a naive 1 -approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal relative to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems. 1
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar
Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.
4 0.7439658 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
Author: Xuming He, Richard S. Zemel
Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1
5 0.73822999 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
6 0.73289478 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
7 0.7327503 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
8 0.7321887 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
9 0.73059064 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
10 0.72809768 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
11 0.72755122 196 nips-2008-Relative Margin Machines
12 0.72676587 62 nips-2008-Differentiable Sparse Coding
13 0.72628963 194 nips-2008-Regularized Learning with Networks of Features
14 0.72605211 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
15 0.72462642 143 nips-2008-Multi-label Multiple Kernel Learning
16 0.7238819 75 nips-2008-Estimating vector fields using sparse basis field expansions
17 0.71984315 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
18 0.71964669 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
19 0.71940798 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
20 0.71795607 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach