nips nips2008 nips2008-99 knowledge-graph by maker-knowledge-mining

99 nips-2008-High-dimensional support union recovery in multivariate regression


Source: pdf

Author: Guillaume R. Obozinski, Martin J. Wainwright, Michael I. Jordan

Abstract: We study the behavior of block 1 / 2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Studying this problem under high-dimensional scaling (where the problem parameters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θ 1 / 2 (n, p, s) : = n/[2ψ(B ∗ ) log(p − s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B ∗ ) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block 1 / 2 regularization for multivariate regression never harms performance relative to a naive 1 -approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal relative to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 High-dimensional support union recovery in multivariate regression Guillaume Obozinski Department of Statistics UC Berkeley gobo@stat. [sent-1, score-0.588]

2 edu Abstract We study the behavior of block 1 / 2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. [sent-10, score-0.276]

3 The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. [sent-11, score-0.671]

4 Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B ∗ ) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. [sent-13, score-0.494]

5 We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems. [sent-15, score-0.13]

6 1 Introduction A recent line of research in machine learning has focused on regularization based on block-structured norms. [sent-16, score-0.125]

7 Such structured norms are well motivated in various settings, among them kernel learning [3, 8], grouped variable selection [12], hierarchical model selection [13], simultaneous sparse approximation [10], and simultaneous feature selection in multi-task learning [7]. [sent-17, score-0.25]

8 The focus of this paper is the model selection consistency of block-structured regularization in the setting of multivariate regression. [sent-19, score-0.223]

9 Our goal is to perform model or variable selection, by which we mean extracting the subset of relevant covariates that are active in at least one regression. [sent-20, score-0.096]

10 We refer to this problem as the support union problem. [sent-21, score-0.295]

11 , [2, 9, 14, 11]), our analysis is high-dimensional in nature, meaning that we allow the model dimension p (as well as other structural parameters) to grow along with the sample size n. [sent-24, score-0.12]

12 A great deal of work has focused on the case of ordinary 1 -regularization (Lasso) [2, 11, 14], showing for instance that the Lasso can recover the support of a sparse signal even when p n. [sent-25, score-0.165]

13 1 Some more recent work has studied consistency issues for block-regularization schemes, including classical analysis (p fixed) of the group Lasso [1], and high-dimensional analysis of the predictive risk of block-regularized logistic regression [5]. [sent-26, score-0.152]

14 In this paper, our goal is to understand the following question: under what conditions does block regularization lead to a quantifiable improvement in statistical efficiency, relative to more naive regularization schemes? [sent-28, score-0.413]

15 Here statistical efficiency is assessed in terms of the sample complexity, meaning the minimal sample size n required to recover the support union; we wish to know how this scales as a function of problem parameters. [sent-29, score-0.342]

16 More specifically, we consider the following problem of multivariate linear regression: a group of K scalar outputs are regressed on the same design matrix X ∈ Rn×p . [sent-31, score-0.257]

17 Representing the regression coefficients as a p × K matrix B ∗ , the regression model takes the form Y = XB ∗ + W, (1) where Y ∈ Rn×K and W ∈ Rn×K are matrices of observations and zero-mean noise respectively and B ∗ has columns β ∗(1) , . [sent-32, score-0.361]

18 , β ∗(K) which are the parameter vectors of each univariate regression. [sent-35, score-0.11]

19 We are interested in recovering the union of the supports of individual regressions, more specifically ∗(k) if Sk = i ∈ {1, . [sent-36, score-0.29]

20 More generally, block-norm regularizations can be thought of as the relaxation of a non-convex regularization which counts the number of ∗(k) covariates i for which at least one of the univariate regression parameters βi is non-zero. [sent-43, score-0.41]

21 In particular the 1 / 1 regularization is the same as the usual Lasso. [sent-48, score-0.125]

22 While 1 / ∞ is of interest, it seems intuitively to be relevant essentially to situations where the support is exactly the same for all regressions, an assumption that we are not willing to make. [sent-50, score-0.095]

23 In the current paper, we focus on the 1 / 2 case and consider the estimator B obtained by solving the following disguised second-order cone program: min B∈Rp×K 1 2 |||Y − XB|||F + λn B 2n 1/ 2 , (2) where |||M |||F : = ( i,j m2 )1/2 denotes the Frobenius norm. [sent-51, score-0.098]

24 We study the support union problem ij under high-dimensional scaling, meaning that the number of observations n, the ambient dimension p and the size of the union of supports s can all tend to infinity. [sent-52, score-0.76]

25 More precisely, the probability of correct support union recovery converges to one for all sequences (n, p, s, B ∗ ) such that the control parameter θ 1 / 2 (n, p ; B ∗ ) exceeds a fixed critical threshold θcrit < +∞. [sent-54, score-0.562]

26 Note that θ 1 / 2 is a measure of the sample complexity of the problem—that is, the sample size required for exact recovery as a function of the problem parameters. [sent-55, score-0.35]

27 Section 4 illustrates with simulations the sharpness of our analysis and how quickly the asymptotic regime arises. [sent-59, score-0.1]

28 2 Main result and some consequences The analysis of this paper applies to multivariate linear regression problems of the form (1), in which the noise matrix W ∈ Rn×K is assumed to consist of i. [sent-65, score-0.251]

29 Suppose that we partition the full set of covariates into the support set S and its complement S c , with |S| = s, |S c | = p − s. [sent-73, score-0.221]

30 Consider the following block decompositions of the regression coefficient matrix, the design matrix and its covariance matrix: B∗ = ∗ BS , ∗ BS c X = [XS XS c ] , and Σ= ΣSS ΣS c S ΣSS c . [sent-74, score-0.343]

31 ΣS c S c ∗ We use βi to denote the ith row of B ∗ , and assume that the sparsity of B ∗ is assessed as follows: ∗ (A0) Sparsity: The matrix B ∗ has row support S : = {i ∈ {1, . [sent-75, score-0.382]

32 In addition, we make the following assumptions about the covariance Σ of the design matrix: (A1) Bounded eigenspectrum: There exist a constant Cmin > 0 (resp. [sent-79, score-0.081]

33 Assumption A1 is a standard condition required to prevent excess dependence among elements of the design matrix associated with the support S. [sent-85, score-0.27]

34 The mutual incoherence assumption A2 is also well known from previous work on model selection with the Lasso [10, 14]. [sent-86, score-0.101]

35 More generally, it can be shown that various matrix classes satisfy these conditions [14, 11]. [sent-88, score-0.174]

36 1 Statement of main result With the goal of estimating the union of supports S, our main result is a set of sufficient conditions using the following procedure. [sent-90, score-0.412]

37 Solve the block-regularized problem (2) with regularization parameter λn > 0, thereby obtaining a solution B = B(λn ). [sent-91, score-0.193]

38 Use this solution to compute an estimate of the support union as S(B) : = i ∈ {1, . [sent-92, score-0.325]

39 This estimator is unambiguously defined if the solution B is unique, and as part of our analysis, we show that the solution B is indeed unique with high probability in the regime of interest. [sent-96, score-0.144]

40 We study the behavior of this estimator for a 3 sequence of linear regressions indexed by the triplet (n, p, s), for which the data follows the general model presented in the previous section with defining parameters B ∗ (n) and Σ(n) satisfying A0A3. [sent-97, score-0.223]

41 As (n, p, s) tends to infinity, we give conditions on the triplet and properties of B ∗ for which B is unique, and such that P[S = S] → 1. [sent-98, score-0.098]

42 The central objects in our main result are the sparsity-overlap function, and the sample complexity β parameter, which we define here. [sent-99, score-0.164]

43 We extend the function ζ to any matrix BS ∈ Rs×K with non-zero rows by defining the matrix ζ(BS ) ∈ Rs×K with ith row [ζ(BS )]i = ζ(βi ). [sent-101, score-0.216]

44 With this notation, we define the sparsity-overlap function ψ(B) and the sample complexity parameter θ 1 / 2 (n, p ; B ∗ ) as ψ(B) : = ζ(BS )T (ΣSS )−1 ζ(BS ) and 2 θ 1/ 2 (n, p ; B ∗ ) : = n . [sent-102, score-0.172]

45 2 ψ(B ∗ ) log(p−s) (4) ∗ ∗ Finally, we use b∗ : = mini∈S βi 2 to denote the minimal 2 row-norm of the matrix BS . [sent-103, score-0.083]

46 N (0, Σ) row vectors, an observation matrix Y specified by model (1), and a regression matrix B ∗ such that (b∗ )2 decays strictly min more slowly than f (p) max {s, log(p − s)}, for any function f (p) → +∞. [sent-108, score-0.396]

47 Suppose that we solve n the block-regularized program (2) with regularization parameter λn = Θ f (p) log(p)/n . [sent-109, score-0.192]

48 (ii) A technical condition that we require on the regularization parameter is λ2 n n → ∞ log(p − s) (5) which is satisfied by the choice given in the statement. [sent-112, score-0.205]

49 The simplest special case is the univariate regression problem (K = 1), in which case the function ζ(β ∗ ) outputs an s-dimensional ∗ ∗ sign vector with elements zi = sign(βi ), so that ψ(β ∗ ) = z ∗ T (ΣSS )−1 z ∗ = Θ(s). [sent-115, score-0.237]

50 Consequently, the order parameter of block 1 / 2 -regression for univariate regresion is given by Θ(n/(2s log(p − s)), which matches the scaling established in previous work on the Lasso [11]. [sent-116, score-0.208]

51 More generally, given our assumption (A1) on ΣSS , the sparsity overlap ψ(B ∗ ) always lies in the s T interval [ KCsmax , Cmin ]. [sent-117, score-0.105]

52 At the most pessimistic extreme, suppose that B ∗ : = β ∗ 1K —that is, B ∗ ∗ p consists of K copies of the same coefficient vector β ∈ R , with support of cardinality |S| = s. [sent-118, score-0.168]

53 This might seem a pessimistic result, since under model (1), we essentially have Kn observations of the coefficient vector β ∗ with the same design matrix but K independent noise realizations. [sent-120, score-0.174]

54 At the most optimistic extreme, consider the case where ΣSS = Is and (for s > K) suppose that B ∗ is constructed such that the columns of the s × K matrix ζ(B ∗ ) are all orthogonal and of equal length. [sent-122, score-0.198]

55 If the columns of the matrix ζ(B ∗ ) are all orthogonal with equal length and ΣSS = Is×s then the block-regularized problem (2) succeeds in union support recovery s once the sample complexity parameter n/(2 K log(p − s)) is larger than 1. [sent-124, score-0.79]

56 Consequently, Corollary 1 shows that under suitable conditions on the regression coefficient matrix B ∗ , 1 / 2 can provides a K-fold reduction in the number of samples required for exact support recovery. [sent-126, score-0.357]

57 As a third illustration, consider, for ΣSS = Is×s , the case where the supports Sk of individual regression problems are all disjoint. [sent-127, score-0.207]

58 The sample complexity parameter for each of the individual Lassos is n/(2sk log(p − sk )) where |Sk | = sk , so that the sample size required to recover the support union from individual Lassos scales as n = Θ(maxk [sk log(p − sk )]). [sent-128, score-0.998]

59 However, if the ∗ ∗ ∗ ∗ supports are all disjoint, then the columns of the matrix ZS = ζ(BS ) are orthogonal, and ZS TZS = diag(s1 , . [sent-129, score-0.217]

60 , sK ) so that ψ(B ∗ ) = maxk sk and the sample complexity is the same. [sent-132, score-0.304]

61 However, this is not always true if ΣSS = Is×s and in many situations 1 / 2 -regularization can have higher sample complexity than separate Lassos. [sent-134, score-0.134]

62 We then construct a primal matrix B along with a dual matrix Z such that, under the conditions of Theorem 1, with probability converging to 1: (a) The pair (B, Z) satisfies the Karush-Kuhn-Tucker (KKT) conditions of the SOCP. [sent-137, score-0.43]

63 (b) In spite of the fact that for general high-dimensional problems (with p n), the SOCP need not have a unique solution a priori, a strict feasibility condition satisfied by the dual variables Z guarantees that B is the unique optimal solution of (2). [sent-138, score-0.272]

64 ˆ (c) The support union S of B is identical to the support union S of B ∗ . [sent-139, score-0.59]

65 At the core of our constructive procedure is the following convex-analytic result, which characterizes an optimal primal-dual pair for which the primal solution B correctly recovers the support set S: Lemma 1. [sent-140, score-0.17]

66 1 Construction of primal-dual witness Based on Lemma 1, we construct the primal dual pair (B, Z) as follows. [sent-145, score-0.13]

67 (7) 1 T Since s < n, the empirical covariance (sub)matrix ΣSS = n XS XS is strictly positive definite with probability one, which implies that the restricted problem (7) is strictly convex and therefore has a unique optimum BS . [sent-148, score-0.081]

68 Since any such matrix ZS is also a dual solution to the SOCP (7), it must be an element of the subdifferential ∂ BS 1 / 2 . [sent-150, score-0.153]

69 From equation (6b) and since ΣSS is invertible, we may solve as follows ∗ (BS − BS ) = ΣSS ∗ For any row i ∈ S, we have βi 2 ≥ βi following event occurs with high probability E(US ) : = −1 2 T XS W − λn Z S n − US US ∞/ 2 ∞/ 2 ≤ = : US . [sent-153, score-0.091]

70 Thus, it suffices to show that the 1 ∗ b 2 min (9) to show that no row of BS is identically zero. [sent-155, score-0.085]

71 ∗ Turning to condition (6c), by substituting expression (8) for the difference (BS − BS ) into equac tion (6c), we obtain a (p − s) × K random matrix VS c , whose row j ∈ S is given by Vj T Xj := [ΠS − In ] XS W − λn (ΣSS )−1 ZS . [sent-157, score-0.175]

72 n n (10) In order for condition (6c) to hold, it is necessary and sufficient that the probability of the event E(VS c ) := VS c ∞/ 2 < λn (11) converges to one as n tends to infinity. [sent-158, score-0.083]

73 Until now, we haven’t appealed to the sample complexity parameter ∗ θ 1 / 2 (n, p ; B ). [sent-171, score-0.172]

74 Conditionally on W and XS , we have Vj − E [Vj | XS , W ] 2 2 | W, XS d = ΣS c | S ξ T Mn ξj , jj j where ξj ∼ N (0K , IK ) and where the K × K matrix Mn = Mn (XS , W ) is given by λ2 T 1 n ZS (ΣSS )−1 ZS + 2 W T (ΠS − In )W. [sent-174, score-0.147]

75 (13) n n But the covariance matrix Mn is itself concentrated. [sent-175, score-0.114]

76 Under the conditions (5) of Theorem 1, for any δ > 0, the following event T (δ) has probability converging to 1: ψ(B ∗ ) (1 + δ) . [sent-177, score-0.158]

77 Mn := Given that (ΣS c | S )jj ≤ (ΣS c S c )jj ≤ Cmax for all j, on the event T (δ), we have ψ(B ∗ ) T max ξj 2 and max (ΣS c | S )jj ξj Mn ξj ≤ Cmax |||Mn |||2 max ξj 2 ≤ Cmax λ2 2 2 n j∈S c j∈S c n j∈S c 2 n 1 γ P[T3 ≥ γ | T (δ)] ≤ P max ξj 2 ≥ 2t∗ (n, B ∗ ) with t∗ (n, B ∗ ) : = . [sent-179, score-0.153]

78 2 j∈S c 2 Cmax ψ(B ∗ ) (1 + δ) Finally using the union bound and a large deviation bound for χ2 variates we get the following condition which is equivalent to the condition of Theorem 1: θ 1 / 2 (n, p ; B ∗ ) > θcrit (Σ): Lemma 6. [sent-180, score-0.284]

79 Simulations In this section, we illustrate the sharpness of Theorem 1 and furthermore ascertain how quickly the predicted behavior is observed as n, p, s grow in different regimes, for two regression tasks (i. [sent-182, score-0.178]

80 In the following√ simulations, the matrix B ∗ of regression coefficients is designed √ ∗ with entries βij in {−1/ 2, 1/ 2} to yield a desired value of ψ(B ∗ ). [sent-185, score-0.2]

81 The design matrix X is √ ∗ sampled from the standard Gaussian ensemble. [sent-186, score-0.133]

82 From our analysis, the sample complexity parameter θ 1 / 2 is controlled by the “interference” of irrelevant covariates, and not by the variance of a noise component. [sent-189, score-0.172]

83 We consider linear sparsity with s = αp, for α = 1/8, for various ambient model dimensions p ∈ {32, 256, 1024}. [sent-190, score-0.162]

84 For each value of p, we perform simulations varying the sample size n to match corresponding values of the basic Lasso sample complexity parameter, given by θLas : = n/(2s log(p − s)), in the interval [0. [sent-191, score-0.264]

85 In each case, we solve the blockregularized problem (2) with sample size n = 2θLas s log(p − s) using the regularization parameter λn = log(p − s) (log s)/n. [sent-194, score-0.254]

86 For our construction of matrices B ∗ , we choose both p and the scalings for the sparsity so that the obtained values for s that are multiples of four, and construct the columns Z (1)∗ and Z (2)∗ of the matrix B ∗ = ζ(B ∗ ) from copies of vectors of length 4. [sent-197, score-0.235]

87 Denoting by ⊗ the usual matrix tensor product, we consider: Identical regressions: We set Z (1)∗ = Z (2)∗ = 1 √ 1s , 2 so that the sparsity-overlap is ψ(B ∗ ) = s. [sent-198, score-0.083]

88 Figure 1 shows plots of all three cases and the reference Lasso case for the three different values of the ambient dimension and the two types of sparsity described above. [sent-203, score-0.162]

89 Plots of support recovery probability P[S = S] versus the basic 1 control parameter θLas=n/[2s log(p − s)] for linear sparsity s=p/8, and for increasing values of p ∈ {32, 256, 1024} from left to right. [sent-224, score-0.334]

90 Each graph shows four curves corresponding to the case of independent 1 regularization (pluses), and for 1 / 2 regularization, the cases of identical regression (crosses), intermediate angles (nablas), and orthogonal regressions (squares). [sent-225, score-0.5]

91 5 Discussion We studied support union recovery under high-dimensional scaling with the 1 / 2 regularization, and shown that its sample complexity is determined by the function ψ(B ∗ ). [sent-230, score-0.59]

92 The latter integrates the sparsity of each univariate regression with the overlap of all the supports and the discrepancies between each of the vectors of parameter estimated. [sent-231, score-0.422]

93 In favorable cases, for K regressions, the sample complexity for 1 / 2 is K times smaller than that of the Lasso. [sent-232, score-0.134]

94 Moreover, this gain is not obtained at the expense of an assumption of shared support over the data. [sent-233, score-0.095]

95 In fact, for standard Gaussian designs, the regularization seems “adaptive” in sense that it doesn’t perform worse than the Lasso for disjoint supports. [sent-234, score-0.125]

96 Stable recovery of sparse overcomplete representations in the presence of noise. [sent-255, score-0.125]

97 On the asymptotic properties of the group lasso estimator for linear models. [sent-273, score-0.26]

98 Joint covariate selection and joint subspace selection for multiple classification problems. [sent-279, score-0.094]

99 Sharp thresholds for high-dimensional and noisy recovery of sparsity using using 1 -constrained quadratic programs. [sent-306, score-0.201]

100 Model selection and estimation in regression with grouped variables. [sent-311, score-0.197]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bs', 0.462), ('ss', 0.367), ('xs', 0.288), ('union', 0.2), ('lasso', 0.191), ('zs', 0.19), ('regressions', 0.153), ('cmax', 0.143), ('sk', 0.134), ('crit', 0.128), ('regularization', 0.125), ('recovery', 0.125), ('regression', 0.117), ('las', 0.115), ('mn', 0.107), ('covariates', 0.096), ('support', 0.095), ('supports', 0.09), ('ambient', 0.086), ('matrix', 0.083), ('dmax', 0.082), ('cmin', 0.082), ('norms', 0.076), ('sparsity', 0.076), ('complexity', 0.073), ('lemma', 0.073), ('univariate', 0.072), ('orthogonal', 0.071), ('jj', 0.064), ('log', 0.064), ('conditions', 0.062), ('uc', 0.062), ('block', 0.062), ('sharpness', 0.061), ('socp', 0.061), ('vs', 0.061), ('sample', 0.061), ('converging', 0.055), ('incoherence', 0.054), ('multivariate', 0.051), ('unique', 0.05), ('design', 0.05), ('row', 0.05), ('theorem', 0.048), ('sign', 0.048), ('selection', 0.047), ('coef', 0.046), ('primal', 0.045), ('berkeley', 0.045), ('harms', 0.045), ('lassos', 0.045), ('witness', 0.045), ('columns', 0.044), ('ces', 0.043), ('condition', 0.042), ('rs', 0.042), ('pessimistic', 0.041), ('event', 0.041), ('dual', 0.04), ('ip', 0.04), ('naive', 0.039), ('exceeds', 0.039), ('simulations', 0.039), ('obozinski', 0.038), ('xb', 0.038), ('info', 0.038), ('regressed', 0.038), ('parameter', 0.038), ('recover', 0.038), ('vj', 0.037), ('triplet', 0.036), ('maxk', 0.036), ('scaling', 0.036), ('group', 0.035), ('min', 0.035), ('intermediate', 0.034), ('estimator', 0.034), ('ensemble', 0.033), ('grouped', 0.033), ('designs', 0.033), ('critical', 0.033), ('ordinary', 0.032), ('copies', 0.032), ('correct', 0.032), ('covariance', 0.031), ('main', 0.03), ('ij', 0.03), ('solution', 0.03), ('zhao', 0.03), ('feasibility', 0.03), ('complement', 0.03), ('size', 0.03), ('schemes', 0.029), ('program', 0.029), ('overlap', 0.029), ('cone', 0.029), ('satisfy', 0.029), ('meaning', 0.029), ('max', 0.028), ('assessed', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 99 nips-2008-High-dimensional support union recovery in multivariate regression

Author: Guillaume R. Obozinski, Martin J. Wainwright, Michael I. Jordan

Abstract: We study the behavior of block 1 / 2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Studying this problem under high-dimensional scaling (where the problem parameters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θ 1 / 2 (n, p, s) : = n/[2ψ(B ∗ ) log(p − s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B ∗ ) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block 1 / 2 regularization for multivariate regression never harms performance relative to a naive 1 -approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal relative to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems. 1

2 0.37169451 179 nips-2008-Phase transitions for high-dimensional joint support recovery

Author: Sahand Negahban, Martin J. Wainwright

Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1

3 0.25711033 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.

4 0.17933299 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

5 0.17078024 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

Author: Tong Zhang

Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1

6 0.14168848 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

7 0.098426759 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields

8 0.092963181 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions

9 0.089060903 214 nips-2008-Sparse Online Learning via Truncated Gradient

10 0.080146633 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

11 0.078898042 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints

12 0.074097253 106 nips-2008-Inferring rankings under constrained sensing

13 0.073656395 194 nips-2008-Regularized Learning with Networks of Features

14 0.073621869 62 nips-2008-Differentiable Sparse Coding

15 0.073423482 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

16 0.069767572 75 nips-2008-Estimating vector fields using sparse basis field expansions

17 0.068880834 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

18 0.068319686 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

19 0.064226419 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube

20 0.063840151 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.231), (1, -0.028), (2, -0.16), (3, 0.188), (4, 0.155), (5, 0.009), (6, -0.155), (7, -0.001), (8, -0.092), (9, 0.128), (10, -0.208), (11, -0.138), (12, -0.081), (13, -0.077), (14, -0.078), (15, 0.103), (16, 0.003), (17, -0.062), (18, -0.01), (19, 0.013), (20, -0.022), (21, -0.019), (22, 0.026), (23, 0.003), (24, 0.075), (25, 0.05), (26, 0.058), (27, 0.148), (28, 0.075), (29, 0.05), (30, 0.145), (31, 0.069), (32, -0.061), (33, -0.171), (34, 0.061), (35, -0.005), (36, 0.001), (37, 0.096), (38, -0.006), (39, -0.036), (40, 0.115), (41, 0.149), (42, 0.038), (43, 0.086), (44, -0.023), (45, -0.015), (46, 0.152), (47, 0.019), (48, 0.036), (49, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96861213 99 nips-2008-High-dimensional support union recovery in multivariate regression

Author: Guillaume R. Obozinski, Martin J. Wainwright, Michael I. Jordan

Abstract: We study the behavior of block 1 / 2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Studying this problem under high-dimensional scaling (where the problem parameters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θ 1 / 2 (n, p, s) : = n/[2ψ(B ∗ ) log(p − s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B ∗ ) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block 1 / 2 regularization for multivariate regression never harms performance relative to a naive 1 -approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal relative to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems. 1

2 0.93392372 179 nips-2008-Phase transitions for high-dimensional joint support recovery

Author: Sahand Negahban, Martin J. Wainwright

Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1

3 0.80042028 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.

4 0.69994444 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

5 0.52099508 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

Author: Pierre Garrigues, Laurent E. Ghaoui

Abstract: It has been shown that the problem of 1 -penalized least-square regression commonly referred to as the Lasso or Basis Pursuit DeNoising leads to solutions that are sparse and therefore achieves model selection. We propose in this paper RecLasso, an algorithm to solve the Lasso with online (sequential) observations. We introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. We compare our method to Lars and Coordinate Descent, and present an application to compressive sensing with sequential observations. Our approach can easily be extended to compute an homotopy from the current solution to the solution that corresponds to removing a data point, which leads to an efficient algorithm for leave-one-out cross-validation. We also propose an algorithm to automatically update the regularization parameter after observing a new data point. 1

6 0.49725065 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

7 0.48914093 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform

8 0.40564558 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields

9 0.3851997 106 nips-2008-Inferring rankings under constrained sensing

10 0.34475473 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

11 0.34088004 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions

12 0.33249786 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube

13 0.33171102 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints

14 0.32424411 85 nips-2008-Fast Rates for Regularized Objectives

15 0.32165152 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling

16 0.32158968 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

17 0.31627357 185 nips-2008-Privacy-preserving logistic regression

18 0.31282485 126 nips-2008-Localized Sliced Inverse Regression

19 0.31245366 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection

20 0.29937559 111 nips-2008-Kernel Change-point Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.074), (7, 0.104), (12, 0.056), (15, 0.012), (16, 0.02), (24, 0.045), (27, 0.226), (28, 0.144), (57, 0.05), (59, 0.03), (63, 0.026), (71, 0.028), (77, 0.035), (83, 0.072)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85779262 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression

Author: Mauricio Alvarez, Neil D. Lawrence

Abstract: We present a sparse approximation approach for dependent output Gaussian processes (GP). Employing a latent function framework, we apply the convolution process formalism to establish dependencies between output variables, where each latent function is represented as a GP. Based on these latent functions, we establish an approximation scheme using a conditional independence assumption between the output processes, leading to an approximation of the full covariance which is determined by the locations at which the latent functions are evaluated. We show results of the proposed methodology for synthetic data and real world applications on pollution prediction and a sensor network. 1

2 0.82586324 137 nips-2008-Modeling Short-term Noise Dependence of Spike Counts in Macaque Prefrontal Cortex

Author: Arno Onken, Steffen Grünewälder, Matthias Munk, Klaus Obermayer

Abstract: Correlations between spike counts are often used to analyze neural coding. The noise is typically assumed to be Gaussian. Yet, this assumption is often inappropriate, especially for low spike counts. In this study, we present copulas as an alternative approach. With copulas it is possible to use arbitrary marginal distributions such as Poisson or negative binomial that are better suited for modeling noise distributions of spike counts. Furthermore, copulas place a wide range of dependence structures at the disposal and can be used to analyze higher order interactions. We develop a framework to analyze spike count data by means of copulas. Methods for parameter inference based on maximum likelihood estimates and for computation of mutual information are provided. We apply the method to our data recorded from macaque prefrontal cortex. The data analysis leads to three findings: (1) copula-based distributions provide significantly better fits than discretized multivariate normal distributions; (2) negative binomial margins fit the data significantly better than Poisson margins; and (3) the dependence structure carries 12% of the mutual information between stimuli and responses. 1

same-paper 3 0.82162505 99 nips-2008-High-dimensional support union recovery in multivariate regression

Author: Guillaume R. Obozinski, Martin J. Wainwright, Michael I. Jordan

Abstract: We study the behavior of block 1 / 2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Studying this problem under high-dimensional scaling (where the problem parameters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θ 1 / 2 (n, p, s) : = n/[2ψ(B ∗ ) log(p − s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B ∗ ) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block 1 / 2 regularization for multivariate regression never harms performance relative to a naive 1 -approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal relative to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems. 1

4 0.78671783 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling

Author: Ping Li, Kenneth W. Church, Trevor J. Hastie

Abstract: Conditional Random Sampling (CRS) was originally proposed for efficiently computing pairwise (l2 , l1 ) distances, in static, large-scale, and sparse data. This study modifies the original CRS and extends CRS to handle dynamic or streaming data, which much better reflect the real-world situation than assuming static data. Compared with many other sketching algorithms for dimension reductions such as stable random projections, CRS exhibits a significant advantage in that it is “one-sketch-for-all.” In particular, we demonstrate the effectiveness of CRS in efficiently computing the Hamming norm, the Hamming distance, the lp distance, and the χ2 distance. A generic estimator and an approximate variance formula are also provided, for approximating any type of distances. We recommend CRS as a promising tool for building highly scalable systems, in machine learning, data mining, recommender systems, and information retrieval. 1

5 0.68318397 179 nips-2008-Phase transitions for high-dimensional joint support recovery

Author: Sahand Negahban, Martin J. Wainwright

Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1

6 0.67185163 194 nips-2008-Regularized Learning with Networks of Features

7 0.66924858 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

8 0.66908228 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

9 0.66641349 62 nips-2008-Differentiable Sparse Coding

10 0.66536105 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

11 0.66396624 202 nips-2008-Robust Regression and Lasso

12 0.66258079 143 nips-2008-Multi-label Multiple Kernel Learning

13 0.66106945 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

14 0.66093451 75 nips-2008-Estimating vector fields using sparse basis field expansions

15 0.65937406 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

16 0.65782076 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

17 0.65758848 45 nips-2008-Characterizing neural dependencies with copula models

18 0.65750045 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

19 0.65665138 196 nips-2008-Relative Margin Machines

20 0.65459538 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text