nips nips2008 nips2008-202 knowledge-graph by maker-knowledge-mining

202 nips-2008-Robust Regression and Lasso


Source: pdf

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We consider robust least-squares regression with feature-wise disturbance. [sent-7, score-0.479]

2 We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). [sent-8, score-0.908]

3 This provides an interpretation of Lasso from a robust optimization perspective. [sent-9, score-0.331]

4 We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. [sent-10, score-0.663]

5 Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. [sent-11, score-0.271]

6 1 Introduction In this paper we consider linear regression problems with least-square error. [sent-13, score-0.203]

7 These methods minimize a weighted sum of the residual norm and a certain regularization term, x 2 for Tikhonov regularization and x 1 for Lasso. [sent-20, score-0.287]

8 In many of these approaches, the choice of regularization parameters often has no fundamental connection to an underlying noise model [2]. [sent-23, score-0.167]

9 In [11], the authors propose an alternative approach to reducing sensitivity of linear regression, by considering a robust version of the regression problem: they minimize the worst-case residual for the observations under some unknown but bounded disturbances. [sent-24, score-0.516]

10 They show that their robust least squares formulation is equivalent to 2 -regularized least squares, and they explore computational aspects of the problem. [sent-25, score-0.423]

11 In that paper, and in most of the subsequent research in this area and the more general area of Robust Optimization (see [12, 13] and references therein) the disturbance is taken to be either row-wise and uncorrelated [14], or given by bounding the Frobenius norm of the disturbance matrix [11]. [sent-26, score-0.695]

12 In this paper we investigate the robust regression problem under more general uncertainty sets, focusing in particular on the case where the uncertainty set is defined by feature-wise constraints, and also the case where features are meaningfully correlated. [sent-27, score-0.789]

13 We prove that all our formulations are computationally tractable. [sent-29, score-0.136]

14 Unlike much of the previous literature, we provide a focus on structural properties of the robust solution. [sent-30, score-0.318]

15 • We formulate the robust regression problem with feature-wise independent disturbances, and show that this formulation is equivalent to a least-square problem with a weighted 1 norm regularization term. [sent-33, score-0.789]

16 Hence, we provide an interpretation for Lasso from a robustness perspective. [sent-34, score-0.262]

17 We generalize the robust regression formulation to loss functions given by an arbitrary norm, and uncertainty sets that allow correlation between disturbances of different features. [sent-36, score-0.968]

18 • We investigate the sparsity properties for the robust regression problem with feature-wise independent disturbances, showing that such formulations encourage sparsity. [sent-37, score-0.753]

19 We thus easily recover standard sparsity results for Lasso using a robustness argument. [sent-38, score-0.382]

20 This also implies a fundamental connection between the feature-wise independence of the disturbance and the sparsity. [sent-39, score-0.365]

21 This allows us to re-prove consistency in a statistical learning setup, using the new robustness tools and formulation we introduce. [sent-41, score-0.503]

22 Throughout the paper, ai and rj denote the ith column and the j th row of the observation matrix A, respectively; aij is the ij element of A, hence it is the j th element of ri , and ith element of aj . [sent-45, score-0.719]

23 2 Robust Regression with Feature-wise Disturbance We show that our robust regression formulation recovers Lasso as a special case. [sent-47, score-0.644]

24 The regression formulation we consider differs from the standard Lasso formulation, as we minimize the norm of the error, rather than the squared norm. [sent-48, score-0.398]

25 In addition to more flexible and potentially powerful robust formulations, we prove new results, and give new insight into known results. [sent-51, score-0.325]

26 In Section 3, we show the robust formulation gives rise to new sparsity results. [sent-52, score-0.507]

27 Theorem 4) fundamentally depend on (and follow from) the robustness argument, which is not found elsewhere in the literature. [sent-55, score-0.235]

28 Then in Section 4, we establish consistency of Lasso directly from the robustness properties of our formulation, thus explaining consistency from a more physically motivated and perhaps more general perspective. [sent-56, score-0.587]

29 1 Formulation Robust linear regression considers the case that the observed matrix A is corrupted by some disturbance. [sent-58, score-0.203]

30 We consider the following min-max formulation: Robust Linear Regression: min m max b − (A + ∆A)x ∆A∈U x∈R 2 . [sent-60, score-0.121]

31 (1) Here, U is the set of admissible disturbances of the matrix A. [sent-61, score-0.219]

32 In this section, we consider the specific setup where the disturbance is feature-wise uncorrelated, and norm-bounded for each feature: U (δ 1 , · · · , δ m ) δ i ≤ ci , i = 1, · · · , m , 2 (2) for given ci ≥ 0. [sent-62, score-0.543]

33 This formulation recovers the well-known Lasso: Theorem 1. [sent-63, score-0.165]

34 The robust regression problem (1) with the uncertainty set (2) is equivalent to the following 1 regularized regression problem: m min m b − Ax x∈R 2+ i=1 ci |xi | . [sent-64, score-1.065]

35 Showing that the robust regression is a lower bound for the regularized regression follows from the standard triangle inequality. [sent-67, score-0.734]

36 If we take ci = c and normalized ai for all i, Problem (3) is the well-known Lasso [4, 5]. [sent-69, score-0.286]

37 2 Arbitrary norm and correlated disturbance It is possible to generalize this result to the case where the 2 -norm is replaced by an arbitrary norm, and where the uncertainty is correlated from feature to feature. [sent-71, score-0.577]

38 Let · min x∈Rm max ∆A∈Ua a denote an arbitrary norm. [sent-74, score-0.153]

39 Then the robust regression problem b − (A + ∆A)x a ; Ua (δ 1 , · · · , δ m ) δ i is equivalent to the regularized regression problem minx∈Rm a b − Ax ≤ ci , i = 1, · · · , m ; a + m i=1 ci |xi | . [sent-75, score-1.02]

40 Using feature-wise uncorrelated disturbance may lead to overly conservative results. [sent-76, score-0.334]

41 We relax this, allowing the disturbances of different features to be correlated. [sent-77, score-0.22]

42 Consider the following uncertainty set: U (δ 1 , · · · , δ m ) fj ( δ 1 a , · · · , δ m a ) ≤ 0; j = 1, · · · , k , where fj (·) are convex functions. [sent-78, score-0.332]

43 Notice that both k and fj can be arbitrary, hence this is a very general formulation and provides us with significant flexibility in designing uncertainty sets and equivalently new regression algorithms. [sent-79, score-0.587]

44 The following theorem converts this formulation to a convex and tractable optimization problem. [sent-80, score-0.312]

45 The robust regression problem min x∈Rm max ∆A∈U b − (A + ∆A)x 3 a , is equivalent to the following regularized regression problem min b − Ax λ∈Rk ,κ∈Rm ,x∈Rm + + a + v(λ, κ, x) ; (4) k where: v(λ, κ, x) max (κ + |x|) c − m λj fj (c) . [sent-83, score-1.09]

46 Suppose U = (δ 1 , · · · , δ m ) δ 1 a , · · · , δ m · s , then the resulting regularized regression problem is min x∈Rm b − Ax a +l x ∗ s ; where · ∗ s a s ≤ l; for a symmetric norm is the dual norm of · s. [sent-85, score-0.462]

47 The robust regression formulation (1) considers disturbances that are bounded in a set, while in practice, often the disturbance is a random variable with unbounded support. [sent-86, score-1.064]

48 In such cases, it is not possible to simply use an uncertainty set that includes all admissible disturbances, and we need to construct a meaningful U based on probabilistic information. [sent-87, score-0.157]

49 In the full version [15] we consider computationally efficient ways to use chance constraints to construct uncertainty sets. [sent-88, score-0.145]

50 3 Sparsity In this section, we investigate the sparsity properties of robust regression (1), and equivalently Lasso. [sent-89, score-0.666]

51 Lasso’s ability to recover sparse solutions has been extensively discussed (cf [6, 7, 8, 9]), and takes one of two approaches. [sent-90, score-0.132]

52 That is, it assumes that the observations are generated by a (sparse) linear combination of the features, and investigates the asymptotic or probabilistic conditions required for Lasso to correctly recover the generative model. [sent-92, score-0.122]

53 The second approach treats the problem from an optimization perspective, and studies under what conditions a pair (A, b) defines a problem with sparse solutions (e. [sent-93, score-0.125]

54 Substantial research regarding sparsity properties of Lasso can be found in the literature (cf [6, 7, 8, 9, 17, 18, 19, 20] and many others). [sent-99, score-0.154]

55 , [16], and are used as standard tools in investigating sparsity of Lasso from a statistical perspective. [sent-102, score-0.14]

56 However, a proof exploiting robustness and properties of the uncertainty is novel. [sent-103, score-0.438]

57 Indeed, such a proof shows a fundamental connection between robustness and sparsity, and implies that robustifying w. [sent-104, score-0.357]

58 a feature-wise independent uncertainty set might be a plausible way to achieve sparsity for other problems. [sent-107, score-0.231]

59 Given (A, b), let x∗ be an optimal solution of the robust regression problem: ˜ max b − (A + ∆A)x min x∈Rm 2 ∆A∈U . [sent-109, score-0.664]

60 Now let i ˜ U (δ 1 , · · · , δ m ) δ j 2 ≤ cj , j ∈ I; δi 2 ≤ ci + i , i ∈ I . [sent-111, score-0.129]

61 Then, x∗ is an optimal solution of max b − (A + ∆A)x min x∈Rm ˜ for any A that satisfies ai − ai ≤ ˜ ∆A∈U i 2 , ˜ for i ∈ I, and aj = aj for j ∈ I. [sent-112, score-0.781]

62 We have max b − (A + ∆A)x∗ ˜ ∆A∈U 2 = max b − (A + ∆A)x∗ ∆A∈U 2 ˜ = max b − (A + ∆A)x∗ ∆A∈U 2 . [sent-115, score-0.198]

63 ˜ ˜ ˜ For i ∈ I, ai −˜ i ≤ li , and aj = aj for j ∈ I. [sent-116, score-0.439]

64 a Therefore, for any fixed x , the following holds: ˜ max b − (A + ∆A)x 2 ≤ max b − (A + ∆A)x 2 ˜ max b − (A + ∆A)x∗ 2 ˜ ≤ max b − (A + ∆A)x 2 max b − (A + ∆A)x∗ 2 ≤ max b − (A + ∆A)x 2 ∆A∈U ˜ ∆A∈U . [sent-118, score-0.396]

65 Theorem 4 is established using the robustness argument, and is a direct result of the feature-wise independence of the uncertainty set. [sent-122, score-0.354]

66 Consider a generative model1 b = i∈I wi ai + ξ where I ⊆ {1 · · · , m} and ξ is a random variable, i. [sent-124, score-0.258]

67 In this case, for a feature i ∈ I, Lasso would assign zero weight as long as there exists a perturbed value of this feature, such that the optimal regression assigned it zero weight. [sent-127, score-0.422]

68 This is also shown in the next corollary, in which we apply Theorem 4 to show that the problem has a sparse solution as long as an incoherence-type property is satisfied (this result is more in line with the traditional sparsity results). [sent-128, score-0.198]

69 If there exists I ⊂ {1, · · · , m} such that for all v ∈ span {ai , i ∈ I} {b} , v = 1, we have v aj ≤ c ∀j ∈ I, then any optimal solution x∗ satisfies x∗ = 0, ∀j ∈ I. [sent-131, score-0.301]

70 For j ∈ I, let a= denote the projection of aj onto the span of {ai , i ∈ I} {b}, and let j ˆ a+ aj − a= . [sent-133, score-0.326]

71 Let A be such that j j j ai a+ i ˆ ai = Now let ˆ U {(δ 1 , · · · , δ m )| δ i Consider the robust regression problem minx ˆ ˆx b − Aˆ 2 i ∈ I; i ∈ I. [sent-135, score-0.856]

72 This is because aj are orthogonal to the span of of {ˆi , i ∈ I} {b}. [sent-139, score-0.185]

73 to minx ˆ 2 + i∈I ˆ ˆ Since a − aj = a= ≤ c ∀j ∈ I, (and recall that U = {(δ 1 , · · · , δ m )| δ i j Theorem 4 we establish the corollary. [sent-141, score-0.272]

74 Suppose there exists I ⊆ {1, · · · , m}, such that for all i ∈ I, ai < ci . [sent-144, score-0.338]

75 i 1 While we are not assuming generative models to establish the results, it is still interesting to see how these results can help in a generative model setup. [sent-146, score-0.154]

76 5 The next theorem shows that sparsity is achieved when a set of features are “almost” linearly dependent. [sent-147, score-0.233]

77 Given I ⊆ {1, · · · , m} such that there exists a non-zero vector (wi )i∈I satisfying wi ai i∈I 2 ≤ min σi ∈{−1,+1} | i∈I σi ci wi |, then there exists an optimal solution x∗ such that ∃i ∈ I : x∗ = 0. [sent-150, score-0.625]

78 Given I ⊆ {1, · · · , m}, let AI ∗ optimal solution x such that x∗ I ai i∈I , and t i∈I wi ai 2 = rank(AI ). [sent-153, score-0.436]

79 4 Density Estimation and Consistency In this section, we investigate the robust linear regression formulation from a statistical perspective and rederive using only robustness properties that Lasso is asymptotically consistent. [sent-158, score-0.938]

80 In the full version ([15]) we use some intermediate results used to prove consistency, to show that regularization can be identified with the so-called maxmin expected utility (MMEU) framework, thus tying regularization to a fundamental tenet of decision-theory. [sent-160, score-0.288]

81 We restrict our discussion to the case where the magnitude of the allowable uncertainty for all features equals c, (i. [sent-161, score-0.158]

82 , the standard Lasso) and establish the statistical consistency of Lasso from a distributional robustness argument. [sent-163, score-0.424]

83 Throughout, we use cn to represent c where there are n samples (we take cn to zero). [sent-165, score-0.57]

84 Define x(cn , Sn ) x(P) 1 n arg min n arg min x x i=1 b,r (bi − ri x)2 + cn x 1 = arg min x √ n n n i=1 (bi − ri x)2 + cn x (b − r x)2 dP(b, r) . [sent-171, score-0.943]

85 √ In words, x(cn , Sn ) is the solution to Lasso with the tradeoff parameter set to cn n, and x(P) is the “true” optimal solution. [sent-172, score-0.349]

86 This technique is of interest because the standard techniques to establish consistency in statistical learning including VC dimension and algorithm stability often work for a limited range of algorithms, e. [sent-176, score-0.189]

87 In contrast, a much wider range of algorithms have robustness interpretations, allowing a unified approach to prove their consistency. [sent-179, score-0.284]

88 Let {cn } be such that cn ↓ 0 and limn→∞ n(cn )m+1 = ∞. [sent-181, score-0.285]

89 The key to the proof is establishing a connection between robustness and kernel density estimation. [sent-185, score-0.359]

90 Step 1: For a given x, we show that the robust regression loss over the training data is equal to the worst-case expected generalization error. [sent-186, score-0.479]

91 µ∈Pn i=1 (ri ,bi )∈Zi Rm+1 Step 2: Next we show that robust regression has a form like that in the left hand side above. [sent-190, score-0.479]

92 Indeed, consider the following kernel estimator: given samples (bi , ri )n , i=1 n hn (b, r) (ncm+1 )−1 K i=1 where: K(x) b − bi , r − ri c m+1 I[−1,+1]m+1 (x)/2 , (5) . [sent-192, score-0.36]

93 Step 3: Combining the last two steps, and using the fact that b,r |hn (b, r) − h(b, r)|d(b, r) goes to zero almost surely when cn ↓ 0 and ncm+1 ↑ ∞ since hn (·) is a kernel density estimation of f (·) n (see e. [sent-194, score-0.442]

94 5 Conclusion In this paper, we consider robust regression with a least-square-error loss, and extend the results of [11] (i. [sent-203, score-0.479]

95 , Tikhonov regularization is equivalent to a robust formulation for Frobenius norm-bounded disturbance set) to a broader range of disturbance sets and hence regularization schemes. [sent-205, score-1.197]

96 A special case of our formulation recovers the well-known Lasso algorithm, and we obtain an interpretation of Lasso from a robustness perspective. [sent-206, score-0.427]

97 We consider more general robust regression formulations, allowing correlation between the feature-wise noise, and we show that this too leads to tractable convex optimization problems. [sent-207, score-0.59]

98 We exploit the new robustness formulation to give direct proofs of sparseness and consistency for Lasso. [sent-208, score-0.523]

99 As our results follow from robustness properties, it suggests that they may be far more general than Lasso, and that in particular, consistency and sparseness may be properties one can obtain more generally from robustified algorithms. [sent-209, score-0.446]

100 Robust uncertainty principles: Exact signal reconstruction from highly e incomplete frequency information. [sent-250, score-0.119]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lasso', 0.415), ('cn', 0.285), ('disturbance', 0.285), ('robust', 0.276), ('robustness', 0.235), ('regression', 0.203), ('disturbances', 0.181), ('ai', 0.157), ('ax', 0.148), ('aj', 0.141), ('ci', 0.129), ('consistency', 0.121), ('formulation', 0.119), ('uncertainty', 0.119), ('rm', 0.116), ('sparsity', 0.112), ('sn', 0.108), ('ri', 0.104), ('bi', 0.102), ('tikhonov', 0.095), ('corollary', 0.087), ('regularization', 0.087), ('formulations', 0.087), ('fj', 0.086), ('theorem', 0.082), ('norm', 0.076), ('establish', 0.068), ('max', 0.066), ('dp', 0.064), ('pn', 0.063), ('minx', 0.063), ('transactions', 0.06), ('caramanis', 0.059), ('ncm', 0.059), ('wi', 0.058), ('aij', 0.055), ('min', 0.055), ('uncertain', 0.054), ('exists', 0.052), ('regularized', 0.052), ('ua', 0.052), ('sparse', 0.051), ('hn', 0.05), ('prove', 0.049), ('uncorrelated', 0.049), ('sparseness', 0.048), ('qc', 0.047), ('recovers', 0.046), ('solutions', 0.046), ('zi', 0.045), ('investigates', 0.044), ('montreal', 0.044), ('austin', 0.044), ('texas', 0.044), ('span', 0.044), ('generative', 0.043), ('proof', 0.042), ('cf', 0.042), ('properties', 0.042), ('tractable', 0.042), ('connection', 0.041), ('convex', 0.041), ('density', 0.041), ('mcgill', 0.04), ('features', 0.039), ('ith', 0.039), ('weight', 0.039), ('fundamental', 0.039), ('letters', 0.038), ('admissible', 0.038), ('generalize', 0.038), ('residual', 0.037), ('electrical', 0.036), ('zero', 0.036), ('ij', 0.035), ('recover', 0.035), ('solution', 0.035), ('frobenius', 0.035), ('vc', 0.034), ('therein', 0.034), ('investigate', 0.033), ('arbitrary', 0.032), ('ieee', 0.03), ('element', 0.03), ('almost', 0.03), ('designing', 0.03), ('notice', 0.03), ('perspective', 0.03), ('hence', 0.03), ('optimal', 0.029), ('column', 0.029), ('outline', 0.028), ('tools', 0.028), ('equivalent', 0.028), ('optimization', 0.028), ('lim', 0.027), ('interpretation', 0.027), ('feature', 0.027), ('full', 0.026), ('bhattacharyya', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

2 0.24891238 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

Author: Tong Zhang

Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1

3 0.20395224 179 nips-2008-Phase transitions for high-dimensional joint support recovery

Author: Sahand Negahban, Martin J. Wainwright

Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1

4 0.19787848 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

Author: Pierre Garrigues, Laurent E. Ghaoui

Abstract: It has been shown that the problem of 1 -penalized least-square regression commonly referred to as the Lasso or Basis Pursuit DeNoising leads to solutions that are sparse and therefore achieves model selection. We propose in this paper RecLasso, an algorithm to solve the Lasso with online (sequential) observations. We introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. We compare our method to Lars and Coordinate Descent, and present an application to compressive sensing with sequential observations. Our approach can easily be extended to compute an homotopy from the current solution to the solution that corresponds to removing a data point, which leads to an efficient algorithm for leave-one-out cross-validation. We also propose an algorithm to automatically update the regularization parameter after observing a new data point. 1

5 0.18135171 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.

6 0.17933299 99 nips-2008-High-dimensional support union recovery in multivariate regression

7 0.15937367 214 nips-2008-Sparse Online Learning via Truncated Gradient

8 0.13157827 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions

9 0.12548488 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints

10 0.12469028 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

11 0.10243719 106 nips-2008-Inferring rankings under constrained sensing

12 0.10142195 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

13 0.098253384 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

14 0.091150559 216 nips-2008-Sparse probabilistic projections

15 0.090618469 62 nips-2008-Differentiable Sparse Coding

16 0.088644773 244 nips-2008-Unifying the Sensory and Motor Components of Sensorimotor Adaptation

17 0.088139638 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

18 0.082137898 118 nips-2008-Learning Transformational Invariants from Natural Movies

19 0.080957592 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

20 0.079470508 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.253), (1, -0.039), (2, -0.188), (3, 0.165), (4, 0.141), (5, 0.028), (6, -0.199), (7, -0.058), (8, -0.135), (9, 0.145), (10, -0.152), (11, -0.113), (12, -0.063), (13, -0.02), (14, -0.009), (15, 0.049), (16, -0.08), (17, -0.046), (18, -0.042), (19, 0.001), (20, 0.01), (21, 0.007), (22, -0.093), (23, 0.035), (24, 0.048), (25, -0.047), (26, 0.088), (27, 0.072), (28, 0.022), (29, 0.104), (30, 0.117), (31, 0.031), (32, 0.057), (33, 0.054), (34, 0.053), (35, -0.093), (36, -0.005), (37, -0.102), (38, 0.03), (39, -0.026), (40, 0.055), (41, 0.058), (42, -0.024), (43, -0.044), (44, 0.029), (45, 0.033), (46, 0.005), (47, -0.009), (48, -0.062), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97503597 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

2 0.76751989 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

Author: Tong Zhang

Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1

3 0.76246661 179 nips-2008-Phase transitions for high-dimensional joint support recovery

Author: Sahand Negahban, Martin J. Wainwright

Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1

4 0.75184858 99 nips-2008-High-dimensional support union recovery in multivariate regression

Author: Guillaume R. Obozinski, Martin J. Wainwright, Michael I. Jordan

Abstract: We study the behavior of block 1 / 2 regularization for multivariate regression, where a K-dimensional response vector is regressed upon a fixed set of p covariates. The problem of support union recovery is to recover the subset of covariates that are active in at least one of the regression problems. Studying this problem under high-dimensional scaling (where the problem parameters as well as sample size n tend to infinity simultaneously), our main result is to show that exact recovery is possible once the order parameter given by θ 1 / 2 (n, p, s) : = n/[2ψ(B ∗ ) log(p − s)] exceeds a critical threshold. Here n is the sample size, p is the ambient dimension of the regression model, s is the size of the union of supports, and ψ(B ∗ ) is a sparsity-overlap function that measures a combination of the sparsities and overlaps of the K-regression coefficient vectors that constitute the model. This sparsity-overlap function reveals that block 1 / 2 regularization for multivariate regression never harms performance relative to a naive 1 -approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal relative to the design. We complement our theoretical results with simulations that demonstrate the sharpness of the result, even for relatively small problems. 1

5 0.66764057 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

Author: Tong Zhang

Abstract: Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 1

6 0.63984114 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

7 0.62019378 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

8 0.5759297 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints

9 0.54587847 214 nips-2008-Sparse Online Learning via Truncated Gradient

10 0.50934088 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

11 0.4441126 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

12 0.44203594 216 nips-2008-Sparse probabilistic projections

13 0.42028117 62 nips-2008-Differentiable Sparse Coding

14 0.4022994 106 nips-2008-Inferring rankings under constrained sensing

15 0.40226293 185 nips-2008-Privacy-preserving logistic regression

16 0.39793131 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

17 0.3964929 194 nips-2008-Regularized Learning with Networks of Features

18 0.39582616 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm

19 0.39526942 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

20 0.39235261 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.131), (7, 0.093), (12, 0.03), (15, 0.018), (28, 0.161), (34, 0.202), (57, 0.039), (59, 0.044), (63, 0.04), (71, 0.035), (77, 0.06), (83, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83484685 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

2 0.80519456 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning

Author: Duy Nguyen-tuong, Jan R. Peters, Matthias Seeger

Abstract: Learning in real-time applications, e.g., online approximation of the inverse dynamics model for model-based robot control, requires fast online regression techniques. Inspired by local learning, we propose a method to speed up standard Gaussian process regression (GPR) with local GP models (LGP). The training data is partitioned in local regions, for each an individual GP model is trained. The prediction for a query point is performed by weighted estimation using nearby local models. Unlike other GP approximations, such as mixtures of experts, we use a distance based measure for partitioning of the data and weighted prediction. The proposed method achieves online learning and prediction in real-time. Comparisons with other non-parametric regression methods show that LGP has higher accuracy than LWPR and close to the performance of standard GPR and ν-SVR. 1

3 0.7930603 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

Author: David Sontag, Amir Globerson, Tommi S. Jaakkola

Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and significantly faster than a method that does not use partitioning. 1

4 0.74582827 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1

5 0.74477321 75 nips-2008-Estimating vector fields using sparse basis field expansions

Author: Stefan Haufe, Vadim V. Nikulin, Andreas Ziehe, Klaus-Robert Müller, Guido Nolte

Abstract: We introduce a novel framework for estimating vector fields using sparse basis field expansions (S-FLEX). The notion of basis fields, which are an extension of scalar basis functions, arises naturally in our framework from a rotational invariance requirement. We consider a regression setting as well as inverse problems. All variants discussed lead to second-order cone programming formulations. While our framework is generally applicable to any type of vector field, we focus in this paper on applying it to solving the EEG/MEG inverse problem. It is shown that significantly more precise and neurophysiologically more plausible location and shape estimates of cerebral current sources from EEG/MEG measurements become possible with our method when comparing to the state-of-the-art. 1

6 0.73945671 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

7 0.7386803 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

8 0.73689139 143 nips-2008-Multi-label Multiple Kernel Learning

9 0.73653269 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

10 0.73478913 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

11 0.73402381 196 nips-2008-Relative Margin Machines

12 0.73337615 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

13 0.73318118 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

14 0.73286599 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

15 0.7328552 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias

16 0.73261064 85 nips-2008-Fast Rates for Regularized Objectives

17 0.73189574 228 nips-2008-Support Vector Machines with a Reject Option

18 0.73177826 194 nips-2008-Regularized Learning with Networks of Features

19 0.72982085 179 nips-2008-Phase transitions for high-dimensional joint support recovery

20 0.72606319 239 nips-2008-Tighter Bounds for Structured Estimation