nips nips2009 nips2009-20 knowledge-graph by maker-knowledge-mining

20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers


Source: pdf

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A unified framework for high-dimensional analysis of M -estimators with decomposable regularizers Sahand Negahban Department of EECS UC Berkeley sahand n@eecs. [sent-1, score-0.371]

2 In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. [sent-14, score-0.499]

3 The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. [sent-15, score-0.42]

4 We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. [sent-16, score-0.19]

5 Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. [sent-17, score-0.621]

6 In such settings, a common approach to estimating model parameters is is through the use of a regularized M -estimator, in which some loss function (e. [sent-21, score-0.282]

7 Such estimators may also be interpreted from a Bayesian perspective as maximum a posteriori estimates, with the regularizer reflecting prior information. [sent-24, score-0.314]

8 In this paper, we study such regularized M -estimation procedures, and attempt to provide a unifying framework that both recovers some existing results and provides 1 new results on consistency and convergence rates under high-dimensional scaling. [sent-25, score-0.335]

9 The first class is that of sparse vector models; we consider both the case of “hard-sparse” models which involve an explicit constraint on the number on non-zero model parameters, and also a class of “weak-sparse” models in which the ordered coefficients decay at a certain rate. [sent-27, score-0.17]

10 For the case of sparse regression, a popular regularizer is the 1 norm of the parameter vector, which is the sum of the absolute values of the parameters. [sent-31, score-0.485]

11 A number of researchers have studied the Lasso [15, 3] as well as the closely related Dantzig selector [2] and provided conditions on various aspects of its behavior, including 2 -error bounds [7, 1, 21, 2] and model selection consistency [22, 19, 6, 16]. [sent-32, score-0.293]

12 For generalized linear models (GLMs) and exponential family models, estimators based on 1 -regularized maximum likelihood have also been studied, including results on risk consistency [18] and model selection consistency [11]. [sent-33, score-0.355]

13 A body of work has focused on the case of estimating Gaussian graphical models, including convergence rates in Frobenius and operator norm [14], and results on operator norm and model selection consistency [12]. [sent-34, score-0.699]

14 Motivated by inference problems involving block-sparse matrices, other researchers have proposed block-structured regularizers [17, 23], and more recently, high-dimensional consistency results have been obtained for model selection and parameter consistency [4, 8]. [sent-35, score-0.371]

15 In this paper, we derive a single main theorem, and show how we are able to rederive a wide range of known results on high-dimensional consistency, as well as some novel ones, including estimation error rates for low-rank matrices, sparse matrices, and “weakly”-sparse vectors. [sent-36, score-0.231]

16 2 Problem formulation and some key properties In this section, we begin with a precise formulation of the problem, and then develop some key properties of the regularizer and loss function. [sent-38, score-0.365]

17 We consider the problem of estimating θ∗ from the data Z1 , and in order to do so, we consider the following class of regularized M -estimators. [sent-49, score-0.236]

18 We then consider the regularized M -estimator given by θ n ∈ arg min L(θ; Z1 ) + λn r(θ) , p θ∈R (1) where λn > 0 is a user-defined regularization penalty. [sent-52, score-0.238]

19 Throughout the paper, we assume that the loss function L is convex and differentiable, and that the regularizer r is a norm. [sent-54, score-0.365]

20 As discussed earlier, high-dimensional parameter estimation is made possible by structural constraints on θ∗ such as sparsity, and we will see that the behavior of the error is determined by how well these constraints are captured by the regularization function r(·). [sent-57, score-0.312]

21 We now turn to the properties of the regularizer r and the loss function L that underlie our analysis. [sent-58, score-0.365]

22 This notion is a formalization of the manner in which the regularization function imposes constraints on possible parameter vectors θ∗ ∈ Rp . [sent-61, score-0.205]

23 Take some arbitrary inner product space H, and let · 2 denote the norm induced by the inner product. [sent-63, score-0.218]

24 For a given subspace A and vector u ∈ H, we let πA (u) := argminv∈A u − v 2 denote the orthogonal projection of u onto A. [sent-65, score-0.252]

25 We let V = {(A, B) | A ⊆ B ⊥ } be a collection of subspace pairs. [sent-66, score-0.374]

26 For a given statistical model, our goal is to construct subspace collections V such that for any given θ∗ from our model class, there exists a pair (A, B) ∈ V with πA (θ∗ ) 2 ≈ θ∗ 2 , and πB (θ∗ ) 2 ≈ 0. [sent-67, score-0.341]

27 Of most interest to us are subspace pairs (A, B) in which this property holds but the subspace A is relatively small and B is relatively large. [sent-68, score-0.536]

28 As a first concrete (but toy) example, consider the model class of all vectors θ∗ ∈ Rp , and the subspace collection T that consists of a single subspace pair (A, B) = (Rp , 0). [sent-71, score-0.789]

29 We refer to this choice (V = T ) as the trivial subspace collection. [sent-72, score-0.3]

30 For any given subset S and its complement S c , let us define the subspaces A(S) = {θ ∈ Rp | θS c = 0}, and B(S) = {θ ∈ Rp | θS = 0}, and the s-sparse subspace collection S = {(A(S), B(S)) | S ⊂ {1, . [sent-79, score-0.472]

31 With this set-up, we say that the regularizer r is decomposable with respect to a given subspace pair (A, B) if r(u + z) = r(u) + r(z) for all u ∈ A and z ∈ B. [sent-85, score-0.78]

32 The regularizer r is decomposable with respect to a given subspace collection V, meaning that it is decomposable for each subspace pair (A, B) ∈ V. [sent-87, score-1.365]

33 Note that any regularizer is decomposable with respect to the trivial subspace collection T = {(Rp , 0)}. [sent-88, score-0.892]

34 It will be of more interest to us when the regularizer decomposes with respect to a larger collection V that includes subspace pairs (A, B) in which A is relatively small and B is relatively large. [sent-89, score-0.633]

35 Consider a model involving s-sparse regression vectors θ∗ ∈ Rp , and recall the definition of the s-sparse subspace collection S discussed above. [sent-92, score-0.496]

36 We claim that the 1 -norm regularizer r(u) = u 1 is decomposable with respect to S. [sent-93, score-0.47]

37 We can define an inner product on such matrices via k m Θ, Σ = trace(ΘT Σ) and the induced (Frobenius) norm i=1 j=1 Θ2 . [sent-97, score-0.283]

38 For a given subset S, we can define the subspace pair B(S) = Θ ∈ Rk×m | Θi = 0 for all i ∈ S c , and A(S) = (B(S))⊥ , For some fixed s ≤ k, we then consider the collection V = {(A(S), B(S)) | S ⊂ {1, . [sent-102, score-0.462]

39 For any q ∈ [1, ∞], now k m suppose that the regularizer is the 1 / q matrix norm, given by r(Θ) = i=1 [ j=1 |Θij |q ]1/q , corresponding to applying the q norm to each row and then taking the 1 -norm of the result. [sent-106, score-0.591]

40 It can be seen that the regularizer r(Θ) = |||Θ|||1,q is decomposable with respect to the collection V. [sent-107, score-0.592]

41 The estimation of low-rank matrices arises in various contexts, including principal component analysis, spectral clustering, collaborative filtering, and matrix completion. [sent-109, score-0.252]

42 In particular, consider the class of matrices Θ ∈ Rk×m that have rank r ≤ min{k, m}. [sent-110, score-0.175]

43 For a given pair of r-dimensional subspaces U ⊆ Rk and V ⊆ Rm , we define a pair of subspaces A(U, V ) and B(U, V ) of Rk×m as follows: A(U, V ) := Θ ∈ Rk×m | row(Θ) ⊆ V, col(Θ) ⊆ U , B(U, V ) := Θ ∈ R k×m | row(Θ) ⊆ V , col(Θ) ⊆ U ⊥ and ⊥ (3a) . [sent-112, score-0.312]

44 Now suppose that we regularize with the nuclear norm r(Θ) = |||Θ|||1 , corresponding to the sum of the singular values of the matrix Θ. [sent-115, score-0.463]

45 It can be shown that the nuclear norm is decomposable with respect to V. [sent-116, score-0.483]

46 Indeed, since any pair of matrices M ∈ A(U, V ) and M ∈ B(U, V ) have orthogonal row and column spaces, we have |||M + M |||1 = |||M |||1 + |||M |||1 (e. [sent-117, score-0.205]

47 ⊥ Thus, we have demonstrated various models and regularizers in which decomposability is satisfied with interesting subspace collections V. [sent-120, score-0.569]

48 We now show that decomposability has important consequences for the error ∆ = θ − θ∗ , where θ ∈ Rp is any optimal solution of the regularized M -estimation procedure (1). [sent-121, score-0.323]

49 In order to state a lemma that captures this fact, we need to define the u,v dual norm of the regularizer, given by r∗ (v) := supu∈Rp r(u) . [sent-122, score-0.175]

50 For the regularizers of interest, the dual norm can be obtained via some easy calculations. [sent-123, score-0.276]

51 Similarly, given a matrix Θ ∈ Rk×m and the nuclear norm regularizer r(Θ) = |||Θ|||1 , we have r∗ (Θ) = |||Θ|||2 , corresponding to the operator norm (or maximal singular value). [sent-125, score-0.833]

52 This property plays an essential role in our definition of restricted strong convexity and subsequent analysis. [sent-130, score-0.268]

53 (As a trivial example, consider a loss function that is identically zero. [sent-134, score-0.184]

54 In the high-dimensional setting, where the number of parameters p may be much larger than the sample size, the strong convexity assumption need not be satisfied. [sent-136, score-0.179]

55 As a simple example, consider the usual linear regression model y = Xθ∗ + w, where y ∈ Rn is the response vector, θ∗ ∈ Rp is the unknown parameter vector, X ∈ Rn×p is the design matrix, and w ∈ Rn is a noise vector, with i. [sent-137, score-0.182]

56 Herein lies the utility of Lemma 1: it guarantees that the error ∆ must lie within a restricted set, so that we only need the loss function to be strongly convex for a limited set of directions. [sent-143, score-0.227]

57 Given some subset C ⊆ Rp and error norm d(·), we say that the loss function L satisfies restricted strong convexity (RSC) (with respect to d(·)) with parameter γ(L) > 0 over C if L(θ∗ + ∆) − L(θ∗ ) − L(θ∗ ), ∆ ≥ γ(L) d2 (∆) for all ∆ ∈ C. [sent-145, score-0.552]

58 (4) In the statement of our results, we will be interested in loss functions that satisfy RSC over sets C(A, B, ) that are indexed by a subspace pair (A, B) and a tolerance ≥ 0 as follows: C(A, B, ) := ∆ ∈ Rp | r(πB (∆)) ≤ 3r(πB ⊥ (∆)) + 4r(πA⊥ (θ∗ )), d(∆) ≥ . [sent-146, score-0.515]

59 (5) In the special case of least-squares regression with hard sparsity constraints, the RSC condition corresponds to a lower bound on the sparse eigenvalues of the Hessian matrix X T X, and is essentially equivalent to a restricted eigenvalue condition introduced by Bickel et al. [sent-147, score-0.579]

60 3 Convergence rates We are now ready to state a general result that provides bounds and hence convergence rates for the error d(θ − θ∗ ). [sent-149, score-0.248]

61 For a given subspace collection V, suppose that the regularizer r is decomposable, and consider the regularized M -estimator (1) with λn ≥ 2 r∗ ( L(θ∗ )). [sent-154, score-0.873]

62 Then, for any pair of subspaces (A, B) ∈ V and tolerance ≥ 0 such that the loss function L satisfies restricted strong convexity over C(A, B, ), we have d(θ − θ∗ ) ≤ max , 1 2 Ψ(B ⊥ ) λn + γ(L) 2 λn γ(L) r(πA⊥ (θ∗ )) . [sent-155, score-0.579]

63 1 Bounds for linear regression Consider the standard linear regression model y = Xθ∗ + w, where θ∗ ∈ Rp is the regression vector, X ∈ Rn×p is the design matrix, and w ∈ Rn is a noise vector. [sent-165, score-0.32]

64 Without any structural constraints on θ∗ , we can apply Theorem 1 with the trivial subspace collection T = {(Rp , 0)} to establish a rate θ − θ∗ 2 = O(σ p/n) for ridge regression, which holds as long as X is full-rank (and hence requires n > p). [sent-167, score-0.542]

65 1 Lasso estimates of hard sparse models More precisely, let us consider estimating an s-sparse regression vector θ∗ by solving the Lasso 1 program θ ∈ arg minθ∈Rp 2n y − Xθ 2 + λn θ 1 . [sent-171, score-0.271]

66 Recall the definition of the s-sparse 2 subspace collection S from Section 2. [sent-173, score-0.374]

67 For this problem, let us set = 0 so that the restricted strong convexity set (5) reduces to C(A, B, 0) = {∆ ∈ Rp | ∆S c 1 ≤ 3 ∆S 1 }. [sent-175, score-0.268]

68 Establishing restricted strong convexity for the least-squares loss is equivalent to ensuring the following bound on the design matrix: Xθ 2 /n ≥ γ(L) θ 2 2 2 for all θ ∈ Rp such that θS 1 ≤ 3 θS 1. [sent-176, score-0.473]

69 (8) As mentioned previously, this condition is essentially the same as the restricted eigenvalue condition developed by Bickel et al. [sent-177, score-0.249]

70 [10] show that condition (8) holds with high probability for various random ensembles of Gaussian matrices with non-i. [sent-180, score-0.249]

71 Suppose that the true vector θ∗ ∈ Rp is exactly s-sparse with support S, and that the 2 design matrix X satisfies condition (8). [sent-195, score-0.207]

72 As noted previously, the 1 -regularizer is decomposable for the sparse subspace collection S, while condition (8) ensures that RSC holds for all sets C(A, B, 0) with (A, B) ∈ S. [sent-198, score-0.777]

73 Under the column normalization condition on the design matrix X and the sub-Gaussian nature of the noise, it follows that X T w/n ∞ ≤ 4σ 2 log p with n high probability. [sent-201, score-0.25]

74 2 Lasso estimates of weak sparse models We now consider models that satisfy a weak sparsity assumption. [sent-207, score-0.296]

75 Accordingly, we consider the s-sparse subspace collection S with subsets of size s = Rq λ−q . [sent-212, score-0.404]

76 We also assume that the matrix X satisfies the condition log p 1 2 v 1 for constants κ1 , κ2 > 0. [sent-214, score-0.182]

77 We consider estimating θ∗ from observations {(xi , yi )}n by 1 -regularized maximum i=1 n n 1 1 likelihood θ ∈ arg minθ∈Rp − n θ, i=1 yi xi + n i=1 a( θ, xi ) + θ 1 . [sent-224, score-0.289]

78 For analysis, we again use the s-sparse subspace collection S and = 0. [sent-227, score-0.374]

79 With these choices, it can be verified that an appropriate version of the RSC will hold if the second derivative a is strongly convex, and the design matrix X satisfies a version of the condition (8). [sent-228, score-0.267]

80 3 Bounds for sparse matrices In this section, we consider some extensions of our results to estimation of regression matrices. [sent-235, score-0.351]

81 Various authors have proposed extensions of the Lasso based on regularizers that have more structure than the 1 norm (e. [sent-236, score-0.247]

82 As a loss function, we use the Frobenius norm n L(Θ) = |||Y − XΘ|||2 , F and as a regularizer, we use the 1,q -matrix norm for some q ≥ 1, which takes the form |||Θ|||1,q = k i=1 (Θi1 , . [sent-241, score-0.398]

83 , k} where |S| = s, and that the design matrix X ∈ Rn×k satisfies condition (8). [sent-252, score-0.207]

84 For q = 1, solving the group Lasso is identical solving a Lasso problem with sparsity sm and ambient dimension km, and the resulting upper bound 8σ γ(L) s m log(km) n reflects this fact (compare to Corollary 1). [sent-258, score-0.191]

85 [4] established the bound O( γ(L) c m n log k + sm ), which is equivalent apart n √ s log k 8σ s from a term m. [sent-261, score-0.201]

86 4 Bounds for estimating low rank matrices Finally, we consider the implications of our main result for the problem of estimating low-rank matrices. [sent-264, score-0.269]

87 To illustrate our main theorem in this context, let us consider the following instance of low-rank matrix learning. [sent-266, score-0.173]

88 Given a low-rank matrix Θ∗ ∈ Rk×m , suppose that we are given n noisy observations of the form Yi = Xi , Θ∗ + Wi , where Wi ∼ N (0, 1) and A, B := trace(AT B). [sent-267, score-0.17]

89 The following regularized M -estimator can be considered in order to estimate the desired low-rank matrix Θ∗ : n 1 |Yi − Xi , Θ) |2 + |||Θ|||1 , (14) min Θ∈Rm×p 2n i=1 where the regularizer, |||Θ|||1 , is the nuclear norm, or the sum of the singular values of Θ. [sent-269, score-0.365]

90 Recall the rank-r collection V defined for low-rank matrices in Section 2. [sent-270, score-0.223]

91 Thus, for restricted strong convexity to hold it can be shown that the design matrices Xi must satisfy 1 n n i=1 | Xi , ∆ |2 ≥ γ(L) |||∆|||2 F for all ∆ such that |||πB (∆)|||1 ≤ 3 |||πB ⊥ (∆)|||1 , (15) and satisfy the appropriate analog of the column-normalization condition. [sent-274, score-0.572]

92 Suppose that the true matrix Θ∗ has rank r min(k, m), and that the design √ matrices √ √ {Xi } satisfy condition (15). [sent-283, score-0.402]

93 If we solve the regularized M -estimator (14) with λn = 16 k+n m , then with probability at least 1 − c1 exp(−c2 (k + m)), we have |||Θ − Θ∗ |||F ≤ 16 γ(L) rk + n rm . [sent-284, score-0.354]

94 Note that if rank(Θ∗ ) = r, then |||Θ∗ |||1 ≤ r|||Θ∗ |||F so that Ψ(B ⊥ ) = 2r, since the subspace B(U, V )⊥ consists of matrices with rank at most 2r. [sent-286, score-0.397]

95 Standard analysis gives that the dual norm to ||| · |||1 is the operator norm, ||| · |||2 . [sent-288, score-0.221]

96 Applying this observation we may construct a bound on the operator norm of L(Θ∗ ) = n n 1 k m 1 T 2 | ≤ |||vuT |||2 = 1. [sent-289, score-0.223]

97 Minimax rates of estimation for high-dimensional linear regression over q -balls. [sent-358, score-0.203]

98 Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. [sent-382, score-0.331]

99 Model selection and estimation in regression with grouped variables. [sent-426, score-0.202]

100 Model selection consistency of the lasso selection in high-dimensional linear regression. [sent-431, score-0.403]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rp', 0.406), ('regularizer', 0.259), ('subspace', 0.252), ('rsc', 0.214), ('decomposable', 0.211), ('lasso', 0.175), ('rq', 0.173), ('rk', 0.171), ('raskutti', 0.155), ('norm', 0.146), ('regularized', 0.129), ('nuclear', 0.126), ('collection', 0.122), ('decomposability', 0.119), ('convexity', 0.113), ('loss', 0.106), ('corollary', 0.105), ('consistency', 0.104), ('regularizers', 0.101), ('matrices', 0.101), ('subspaces', 0.098), ('satis', 0.09), ('restricted', 0.089), ('uc', 0.087), ('sm', 0.084), ('regression', 0.084), ('suppose', 0.081), ('sparse', 0.08), ('condition', 0.08), ('regularization', 0.079), ('col', 0.078), ('bickel', 0.077), ('sparsity', 0.076), ('annals', 0.074), ('rn', 0.071), ('design', 0.068), ('strong', 0.066), ('wainwright', 0.066), ('rates', 0.063), ('berkeley', 0.062), ('selection', 0.062), ('xi', 0.059), ('lounici', 0.059), ('nlog', 0.059), ('sahand', 0.059), ('matrix', 0.059), ('pair', 0.058), ('dantzig', 0.058), ('bq', 0.058), ('constraints', 0.057), ('estimation', 0.056), ('estimators', 0.055), ('rm', 0.054), ('ravikumar', 0.054), ('es', 0.054), ('eecs', 0.052), ('bounds', 0.051), ('singular', 0.051), ('satisfy', 0.05), ('tolerance', 0.049), ('wi', 0.049), ('trivial', 0.048), ('estimating', 0.047), ('theorem', 0.047), ('operator', 0.046), ('row', 0.046), ('km', 0.046), ('rank', 0.044), ('consequences', 0.043), ('frobenius', 0.043), ('log', 0.043), ('selector', 0.04), ('department', 0.04), ('convergence', 0.039), ('vectors', 0.038), ('concrete', 0.037), ('statistics', 0.037), ('illustrate', 0.037), ('various', 0.036), ('inner', 0.036), ('meinshausen', 0.036), ('analog', 0.035), ('exp', 0.033), ('yi', 0.032), ('error', 0.032), ('rows', 0.032), ('uni', 0.032), ('unless', 0.032), ('holds', 0.032), ('bound', 0.031), ('recovery', 0.031), ('collections', 0.031), ('structural', 0.031), ('notion', 0.031), ('models', 0.03), ('observations', 0.03), ('consider', 0.03), ('establishing', 0.03), ('version', 0.03), ('dual', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

2 0.22130913 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Author: Shuheng Zhou

Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1

3 0.20996012 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

4 0.16037875 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

Author: Mladen Kolar, Le Song, Eric P. Xing

Abstract: To estimate the changing structure of a varying-coefficient varying-structure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a sub-family of this model — piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a two-stage adaptive procedure, which first identifies jump points of structural changes and then identifies relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of time-varying probabilistic graphical models. 1

5 0.14981289 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning. 1

6 0.14355926 147 nips-2009-Matrix Completion from Noisy Entries

7 0.13742086 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

8 0.13286042 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

9 0.12792525 238 nips-2009-Submanifold density estimation

10 0.12710157 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

11 0.12138502 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

12 0.12070522 157 nips-2009-Multi-Label Prediction via Compressed Sensing

13 0.11321653 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

14 0.096045077 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

15 0.09528438 55 nips-2009-Compressed Least-Squares Regression

16 0.094480403 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

17 0.090787202 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

18 0.090491213 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

19 0.088980421 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

20 0.087460615 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.273), (1, 0.211), (2, -0.003), (3, 0.145), (4, -0.016), (5, -0.077), (6, 0.121), (7, -0.232), (8, 0.148), (9, 0.056), (10, 0.021), (11, -0.007), (12, -0.002), (13, -0.013), (14, 0.094), (15, 0.051), (16, 0.008), (17, -0.09), (18, 0.015), (19, 0.158), (20, -0.059), (21, 0.105), (22, 0.057), (23, 0.028), (24, -0.001), (25, -0.025), (26, -0.011), (27, 0.005), (28, -0.009), (29, -0.022), (30, -0.006), (31, 0.067), (32, -0.041), (33, 0.002), (34, 0.022), (35, -0.065), (36, -0.06), (37, -0.008), (38, 0.041), (39, -0.02), (40, -0.025), (41, -0.003), (42, 0.07), (43, -0.001), (44, -0.014), (45, 0.074), (46, -0.021), (47, 0.041), (48, -0.038), (49, 0.127)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97011256 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

2 0.86279935 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Author: Shuheng Zhou

Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1

3 0.75660074 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

Author: Sundeep Rangan, Alyson K. Fletcher

Abstract: A well-known analysis of Tropp and Gilbert shows that orthogonal matching pursuit (OMP) can recover a k-sparse n-dimensional real vector from m = 4k log(n) noise-free linear measurements obtained through a random Gaussian measurement matrix with a probability that approaches one as n → ∞. This work strengthens this result by showing that a lower number of measurements, m = 2k log(n − k), is in fact sufficient for asymptotic recovery. More generally, when the sparsity level satisfies kmin ≤ k ≤ kmax but is unknown, m = 2kmax log(n − kmin ) measurements is sufficient. Furthermore, this number of measurements is also sufficient for detection of the sparsity pattern (support) of the vector with measurement errors provided the signal-to-noise ratio (SNR) scales to infinity. The scaling m = 2k log(n − k) exactly matches the number of measurements required by the more complex lasso method for signal recovery in a similar SNR scaling.

4 0.7188068 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

5 0.71368659 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

Author: Han Liu, Xi Chen

Abstract: This paper studies the forward greedy strategy in sparse nonparametric regression. For additive models, we propose an algorithm called additive forward regression; for general multivariate models, we propose an algorithm called generalized forward regression. Both algorithms simultaneously conduct estimation and variable selection in nonparametric settings for the high dimensional sparse learning problem. Our main emphasis is empirical: on both simulated and real data, these two simple greedy methods can clearly outperform several state-ofthe-art competitors, including LASSO, a nonparametric version of LASSO called the sparse additive model (SpAM) and a recently proposed adaptive parametric forward-backward algorithm called Foba. We also provide some theoretical justifications of specific versions of the additive forward regression. 1

6 0.66069257 55 nips-2009-Compressed Least-Squares Regression

7 0.64928472 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction

8 0.63740069 147 nips-2009-Matrix Completion from Noisy Entries

9 0.63678849 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

10 0.62494361 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

11 0.61781752 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

12 0.61562836 238 nips-2009-Submanifold density estimation

13 0.61252958 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

14 0.61007202 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

15 0.5977425 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

16 0.59544891 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

17 0.59096009 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

18 0.56368542 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

19 0.55522186 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

20 0.54155481 76 nips-2009-Efficient Learning using Forward-Backward Splitting


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.112), (24, 0.115), (25, 0.131), (35, 0.026), (36, 0.164), (39, 0.036), (58, 0.189), (61, 0.016), (71, 0.054), (81, 0.017), (86, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96054751 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

2 0.90789223 147 nips-2009-Matrix Completion from Noisy Entries

Author: Raghunandan Keshavan, Andrea Montanari, Sewoong Oh

Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced in [1], based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. 1

3 0.89915532 94 nips-2009-Fast Learning from Non-i.i.d. Observations

Author: Ingo Steinwart, Andreas Christmann

Abstract: We prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from α-mixing processes. To illustrate this oracle inequality, we use it to derive learning rates for some learning methods including least squares SVMs. Since the proof of the oracle inequality uses recent localization ideas developed for independent and identically distributed (i.i.d.) processes, it turns out that these learning rates are close to the optimal rates known in the i.i.d. case. 1

4 0.89493978 180 nips-2009-On the Convergence of the Concave-Convex Procedure

Author: Gert R. Lanckriet, Bharath K. Sriperumbudur

Abstract: The concave-convex procedure (CCCP) is a majorization-minimization algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms like sparse support vector machines (SVMs), transductive SVMs, sparse principal component analysis, etc. Though widely used in many applications, the convergence behavior of CCCP has not gotten a lot of specific attention. Yuille and Rangarajan analyzed its convergence in their original paper, however, we believe the analysis is not complete. Although the convergence of CCCP can be derived from the convergence of the d.c. algorithm (DCA), its proof is more specialized and technical than actually required for the specific case of CCCP. In this paper, we follow a different reasoning and show how Zangwill’s global convergence theory of iterative algorithms provides a natural framework to prove the convergence of CCCP, allowing a more elegant and simple proof. This underlines Zangwill’s theory as a powerful and general framework to deal with the convergence issues of iterative algorithms, after also being used to prove the convergence of algorithms like expectation-maximization, generalized alternating minimization, etc. In this paper, we provide a rigorous analysis of the convergence of CCCP by addressing these questions: (i) When does CCCP find a local minimum or a stationary point of the d.c. program under consideration? (ii) When does the sequence generated by CCCP converge? We also present an open problem on the issue of local convergence of CCCP.

5 0.89421028 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

Author: Han Liu, Xi Chen

Abstract: This paper studies the forward greedy strategy in sparse nonparametric regression. For additive models, we propose an algorithm called additive forward regression; for general multivariate models, we propose an algorithm called generalized forward regression. Both algorithms simultaneously conduct estimation and variable selection in nonparametric settings for the high dimensional sparse learning problem. Our main emphasis is empirical: on both simulated and real data, these two simple greedy methods can clearly outperform several state-ofthe-art competitors, including LASSO, a nonparametric version of LASSO called the sparse additive model (SpAM) and a recently proposed adaptive parametric forward-backward algorithm called Foba. We also provide some theoretical justifications of specific versions of the additive forward regression. 1

6 0.8930741 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

7 0.89246064 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

8 0.89210588 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

9 0.89196259 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

10 0.89077216 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

11 0.88968241 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

12 0.88437641 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

13 0.88141435 55 nips-2009-Compressed Least-Squares Regression

14 0.88139069 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

15 0.87926716 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

16 0.87890971 122 nips-2009-Label Selection on Graphs

17 0.87805593 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

18 0.87804484 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

19 0.87713391 157 nips-2009-Multi-Label Prediction via Compressed Sensing

20 0.87617213 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem