jmlr jmlr2012 jmlr2012-2 knowledge-graph by maker-knowledge-mining

2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

Source: pdf

Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao

Abstract: The lasso is an important method for sparse, high-dimensional regression problems, with efﬁcient algorithms available, a long history of practical success, and a large body of theoretical results supporting and explaining its performance. But even with the best available algorithms, ﬁnding the lasso solutions remains a computationally challenging task in cases where the number of covariates vastly exceeds the number of data points. Marginal regression, where each dependent variable is regressed separately on each covariate, offers a promising alternative in this case because the estimates can be computed roughly two orders faster than the lasso solutions. The question that remains is how the statistical performance of the method compares to that of the lasso in these cases. In this paper, we study the relative statistical performance of the lasso and marginal regression for sparse, high-dimensional regression problems. We consider the problem of learning which coefﬁcients are non-zero. Our main results are as follows: (i) we compare the conditions under which the lasso and marginal regression guarantee exact recovery in the ﬁxed design, noise free case; (ii) we establish conditions under which marginal regression provides exact recovery with high probability in the ﬁxed design, noise free, random coefﬁcients case; and (iii) we derive rates of convergence for both procedures, where performance is measured by the number of coefﬁcients with incorrect sign, and characterize the regions in the parameter space recovery is and is not possible under this metric. In light of the computational advantages of marginal regression in very high dimensional problems, our theoretical and simulations results suggest that the procedure merits further study. Keywords: high-dimensional regression, lasso, phase diagram, regularization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 But even with the best available algorithms, ﬁnding the lasso solutions remains a computationally challenging task in cases where the number of covariates vastly exceeds the number of data points. [sent-10, score-0.316]

2 Marginal regression, where each dependent variable is regressed separately on each covariate, offers a promising alternative in this case because the estimates can be computed roughly two orders faster than the lasso solutions. [sent-11, score-0.358]

3 The question that remains is how the statistical performance of the method compares to that of the lasso in these cases. [sent-12, score-0.316]

4 In this paper, we study the relative statistical performance of the lasso and marginal regression for sparse, high-dimensional regression problems. [sent-13, score-0.673]

5 Recent years, the lasso (Tibshirani, 1996; Chen et al. [sent-37, score-0.316]

6 In the regression problem, the lasso estimator is deﬁned by βlasso = argmin Y − Xβ β 2 2 +λ β 1, (2) where β 1 = ∑ j |β j | and λ ≥ 0 is a tuning parameter that must be speciﬁed. [sent-39, score-0.445]

7 The lasso gives rise to a convex optimization problem and thus is computationally tractable even for moderately large problems. [sent-40, score-0.316]

8 Gradient descent algorithms for the lasso are faster in practice, but have the same computational complexity. [sent-43, score-0.316]

9 The motivation for our study is that when p is very large, ﬁnding the lasso solutions is computationally demanding. [sent-44, score-0.316]

10 To compute the marginal regression estimates for variable selection, we begin by computing the marginal regression coefﬁcients which, assuming X has been standardized, are α ≡ X T Y. [sent-46, score-0.59]

11 (3) The procedure requires O(np) operations, two orders faster than the lasso for p ≫ n. [sent-48, score-0.316]

12 Our goal is to reopen the case for marginal regression as a plausible alternative to the lasso for large problems. [sent-53, score-0.59]

13 The lasso has the advantage of providing exact reconstruction for a somewhat larger class of coefﬁcients, but marginal regression has a better tolerance for collinearity and is easier to tune. [sent-58, score-0.718]

14 • In the ﬁxed design, noise free, and random effects (β random) case, we give conditions under which marginal regression gives exact reconstruction of | sgn β| with overwhelming probability. [sent-60, score-0.545]

15 We show that both the lasso and marginal regression, properly calibrated, perform similarly in each region. [sent-69, score-0.507]

16 Our simulations show that marginal regression and the lasso perform similarly over a range of parameters in realistic models. [sent-72, score-0.59]

17 ) We will begin by specifying conditions on C, ρ, λ, and t such that in the noise-free case, exact reconstruction of β is possible for the lasso or marginal regression, for all coefﬁcient vectors for which the (non-zero) signal coefﬁcients βS ∈ Mρ . [sent-97, score-0.7]

18 For the noise-free case, Wainwright (2006, Lemma 1) proves that Conditions E, I, and J are necessary and sufﬁcient for exact reconstruction of the sign vector, that is, for the existence of a lasso solution β such that sgn β = sgn β. [sent-112, score-0.668]

19 ) Theorem 1 In the noise-free case, Conditions E’ (or E), I’ (or I), and J’ imply that for all βS ∈ Mρ , there exists a lasso solution β with sgn(β) = sgn(β). [sent-128, score-0.316]

20 Theorem 2 Condition F is necessary and sufﬁcient for exact reconstruction to be possible for some t > 0 with marginal regression. [sent-145, score-0.319]

21 3 Comparison of the Conditions for Exact Recovery of Sign Vector in the Noise-Free Case In this subsection, we use several simple examples to get insight into how the exact-recovery conditions for the lasso and marginal regression compare. [sent-154, score-0.621]

22 • (Example 3) When CSS = I, the lasso conditions are relatively weaker. [sent-156, score-0.347]

23 • (Example 4) Although the conditions for marginal regression do not hold uniformly over any Mρ , they have the advantage that they do not require invertibility of CSS and hence are less sensitive to small eigenvalues. [sent-157, score-0.305]

24 The interior of the parallelogram and that of the hexagon are the regions of (a1 , a2 ) satisfying the conditions of the lasso and marginal regression, respectively; see Example 1. [sent-170, score-0.648]

25 For each row of CNS , the conditions for the lasso and marginal regression are similar to those above, though more complicated. [sent-174, score-0.658]

26 In the special case where βS ∝ 1S , Condition I for the lasso becomes |CNSCSS 1N | ≤ 1N , where the inequality should be interpreted as holding component-wise, and the condition for marginal regression (Condition F) is max{|CNS 1S |} ≤ min{|CSS 1S |}. [sent-176, score-0.672]

27 Under these conditions, the lasso gives exact reconstruction, but Condition F can fail. [sent-182, score-0.389]

28 2113 G ENOVESE , J IN , WASSERMAN AND YAO Choosing λ < ρ, we have Conditions E’, I, and J’ satisﬁed, showing by Theorem 1 that the lasso gives exact reconstruction. [sent-191, score-0.389]

29 It follows that the conditions for the lasso are weaker in this case. [sent-192, score-0.347]

30 Respectively, the conditions for the lasso and marginal regression require s s ci (6) and |(a, 1S )| ≤ | ∑ ci λi ξi |. [sent-209, score-0.743]

31 The condition for the lasso is roughly |(a, ξ1 )| ≤ c1 λ1 , which is rather restrictive since λ1 is small. [sent-216, score-0.398]

32 The Figures suggest that as λ1 gets smaller, the region corresponding to the lasso shrinks substantially, while that corresponding to marginal regression remains the same. [sent-221, score-0.628]

33 As a result, the study on the exact reconstruction conditions for the lasso and marginal regression in this more complicated model can be reduced to a low dimensional setting, like those discussed here. [sent-225, score-0.749]

34 As c varies, the regions for marginal regression remain the same, while the regions for the lasso get substantially smaller. [sent-239, score-0.682]

35 The following lemma says that, if s is known and Condition F’ holds, then marginal regression is able to fully recover the support S with high probability. [sent-257, score-0.322]

36 As c varies, the regions for marginal regression remain the same, but those for the lasso get substantially smaller as λ1 (c) decrease. [sent-261, score-0.636]

37 We estimate s by n sn = sn (X,Y, p) = max 1 ≤ k ≤ p : δn (k) ≥ σn 2 log n + 1 √ (in the case where δn (k) < σn 2 log n for all k, we deﬁne sn = 1). [sent-277, score-0.407]

38 n→∞ Theorem 5 says that the tuning parameter for marginal regression (i. [sent-306, score-0.343]

39 In comparison, how to set the tuning parameter λ for the lasso has been a longstanding open problem in the literature. [sent-309, score-0.362]

40 The estimator was show n to be consistent with σ2 in rather general situations, but unfortunately it is computationally more n expensive than either the lasso or marginal regression. [sent-316, score-0.507]

41 With that being said, we conclude this section by mentioning that both the lasso and marginal regression have their strengths and weakness, and there is no clear winner between these two in general settings. [sent-318, score-0.618]

42 Introduce gi j (t) = Eπ [etu·(xi ,x j ) ] − 1, gi (t) = ∑ gi j (t), ¯ j=i where the random variable u ∼ π and (xi , x j ) denotes the inner product of xi and x j . [sent-342, score-0.404]

43 • The lasso yields exact variable selection if s < 1+maxi= j |Ci j | 2 maxi= j |Ci j | . [sent-373, score-0.463]

44 c • Marginal regression yields exact variable selection if s < 2 maxi= j |Ci j | for some constant c ∈ (0, 1), and that the nonzero coordinates of β have comparable magnitudes (i. [sent-374, score-0.338]

45 In fact, in order for either the lasso or marginal regression to have an exact variable selection, it is required that 1 max |Ci j | ≤ O , s i= j In other words, all coordinates of the Gram matrix C need to be no greater than O(1/s). [sent-386, score-0.829]

46 If there are constants c1 > 0 and c2 ∈ (0, 1/2) such that bn · max{|Ci j |} ≤ c1 / log(p(n) ), an i= j and (1) µn (πn ) lim mn (X (n) ) ≤ c2 , n→∞ an (2) µn (πn ) 2 (n) lim vn (X ) log(p(n) ) = 0, n→∞ a2 n (12) then lim An (an /2, εn , g(n) ; X (n) , πn ) = 0, ¯ n→∞ and Condition F” holds. [sent-401, score-0.437]

47 In conclusion, if we alter our attention from the worst-case scenario to the average scenario, and alter our aim from exact variable selection to exact variable selection with probability ≈ 1, then the condition required for success—Condition F”—is much more relaxed than the Incoherence Condition. [sent-425, score-0.376]

48 √ With εn calibrated as above, the most interesting range for τn is O( 2 log p): when τn ≫ √ 2 log p, exact variable selection can be easily achieved by either the lasso or marginal regres√ sion. [sent-453, score-0.738]

49 With these calibrations, we can rewrite ∗ ∗ dn (β; ε, π) = dn (β; εn , τn ). [sent-458, score-0.33]

50 In the region of Exact Recovery, both the lasso and marginal regression yield exact recovery with high probability. [sent-461, score-0.798]

51 In the region of Almost Full Recovery, it is impossible to have large probability for exact variable selection, but the Hamming distance of both the lasso and marginal regression ≪ pεn . [sent-462, score-0.743]

52 Let βmr be the estimate of using marginal regression with threshold tn = ϑ+r √ √ 2 log p, 2 r if r > ϑ, tn = if r < ϑ, 2r log p, (20) where r is some constant ∈ (ϑ, 1) (note that in the case of r < ϑ, the choice of tn is not necessarily unique). [sent-471, score-1.045]

53 As p → ∞, the Hamming distance of marginal regression with the threshold tn given in 2123 G ENOVESE , J IN , WASSERMAN AND YAO (20) satisﬁes ∗ (n) dn (βmr ; εn , τn ) (ϑ+r)2 ≤ L(n)p1− 4r , (1 + o(1)) · p1−ϑ , r ≥ ϑ, 0 < r < ϑ. [sent-475, score-0.668]

54 As p → ∞, the Hamming distance of the lasso with the tuning parameter λn = 2tn where tn is given in (20), satisﬁes ∗ (n) dn (βlasso ; εn , τn ) (ϑ+r)2 ≤ L(n)p1− 4r , (1 + o(1)) · p1−ϑ , r ≥ ϑ, 0 < r < ϑ. [sent-484, score-0.756]

55 In the Region of Exact Recovery, the Hamming distance for both marginal regression and the lasso are algebraically small. [sent-490, score-0.621]

56 Therefore, except for a probability that is algebraically small, both marginal regression and the lasso give exact recovery. [sent-491, score-0.694]

57 In the Region of Almost Full Recovery, both the Hamming distance of marginal regression and the lasso are much smaller than the number of relevant variables (which ≈ pεn ). [sent-492, score-0.59]

58 In this region, the optimal Hamming distance is algebraically large, so for any variable selection procedure, the probability of exact recovery is algebraically small. [sent-495, score-0.306]

59 8) k=4 lasso MR k = 10 lasso MR 0 0 0 4 4 0 4 4 0 0 0. [sent-511, score-0.632]

60 8 Table 1: Comparison of the lasso and marginal regression for different choices of (a1 , a2 ) and k. [sent-519, score-0.624]

61 It was shown in Wainwright (2006) that there are constants c2 > c1 > 0 such that in the region of {0 < ϑ < 1, r > c2 }, the lasso yields exact variable selection with overwhelming probability, and that in the region of {0 < ϑ < 1, r < c2 }, no procedure could yield exact variable selection. [sent-523, score-0.654]

62 Simulations and Examples We conducted a small-scale simulation study to compare the performance of the lasso and marginal regression. [sent-533, score-0.507]

63 In this experiment, we compare the performance of the lasso and marginal regression with the noiseless linear model Y = Xβ. [sent-542, score-0.59]

64 8 7 Table 2: Comparison of the lasso and marginal regression for different choices of (c, a2 , a3 ). [sent-587, score-0.624]

65 2126 A C OMPARISON OF THE L ASSO AND M ARGINAL R EGRESSION Figure 5: Critical values of exact recovery for the lasso (dashed) and marginal regression (solid). [sent-619, score-0.76]

66 For each combination of these parameters, we generate data and compare the Hamming errors of the lasso and marginal regression, where for each method, the tuning parameters are set ideally. [sent-641, score-0.553]

67 The lasso yields exact recovery, while marginal regression, in each of the four blocks where the corresponding δi is 1, recovers correctly the stronger signal and mistakenly kills the weaker one. [sent-651, score-0.614]

68 D =  −1/2 c 0 1  The primary goal of this experiment is to investigate how different choices of c affect the performance of the lasso and marginal regression. [sent-658, score-0.588]

69 3 predict that, the performance of the lasso gets increasingly unsatisfactory as c increases, while that of marginal regression stay more or less the same. [sent-661, score-0.59]

70 For each combination of these parameters, we apply both the lasso and marginal regression and obtain the Hamming errors of both methods, where similarly, the tuning parameters for each method are set ideally. [sent-673, score-0.636]

71 For any ik , 1 ≤ k ≤ s/2, if we restrict Y to {ik − 1, ik , ik + 1, ik + 2} and call the resulting vector y, then y = Aα, where A is the 4 matrix satisfying A(i, j) = 1{i = j} + a · 1{|i − j| = 1}, 1 ≤ i, j ≤ 4, and α is the 4 × 1 vector such that α1 = α4 = 0, α2 = τ, and α3 = dτ. [sent-705, score-0.355]

72 In this simple model, the performance of the lasso and marginal regression can be similarly analyzed as in Section 2. [sent-706, score-0.59]

73 Now, for each of the combination of (d, ϑ), we use the method of exhausting search to determine the smallest r such that the lasso or marginal regression yields exact recovery with 50 repetition of simulations, respectively (similarly, the tuning parameters of each method are set ideally). [sent-708, score-0.806]

74 Figure 5 suggests that the parameters (a, d) play an important role in determining the performance of the lasso and marginal regression. [sent-713, score-0.507]

75 For example, when a · d < 0, it is known that the marginal regression faces a so-called challenge of signal cancellation (see for example Wasserman and Roeder, 2009). [sent-717, score-0.308]

76 It seems that the lasso handles signal cancellation better than does marginal regression. [sent-718, score-0.541]

77 In Experiment 3a, we investigate how the choices of parameters (a, b) and the signal strength affect the performance of the lasso and marginal regression. [sent-730, score-0.575]

78 1), we compare the lasso and marginal regression for τ = 2, 3, . [sent-739, score-0.59]

79 The results suggest that the parameters (a, b) play a key role in the performance of both the lasso and marginal regression. [sent-744, score-0.507]

80 Also, for relatively small a, it seems that marginal regression outperforms the lasso (see Panel 1 and 2 of Figure 6). [sent-746, score-0.59]

81 In Experiment 3b, we take a different angle and investigate how the levels of the signal sparsity affect the performance of the lasso and marginal regression. [sent-747, score-0.541]

82 When ε get larger (so the signals get denser), marginal regression tends to outperform the lasso. [sent-757, score-0.299]

83 Generally, this is a hard problem, and generally, there is no clear winner between the lasso and marginal regression. [sent-762, score-0.535]

84 Suppose that the Gram matrix is sparse in the sense that each row has relatively few large 2130 A C OMPARISON OF THE L ASSO AND M ARGINAL R EGRESSION Figure 6: Comparison of Hamming errors by the lasso (dashed) and marginal regression (solid). [sent-765, score-0.65]

85 Second, it seems that the lasso is comparably more vulnerable to extreme correlation, as discussed in Section 2. [sent-776, score-0.316]

86 2131 G ENOVESE , J IN , WASSERMAN AND YAO Figure 7: Comparison of Hamming errors by the lasso (dashed) and marginal regression (solid). [sent-779, score-0.59]

87 Note that the event {sn > s|Dn } p−1 is contained in the event of ∪k=s {δn (k) ≥ tn |Dn }. [sent-814, score-0.319]

88 Recalling P(Dc ) = o(1), n p−1 P(sn > s) ≤ p−1 ∑ (δ(k) ≥ tn |Dn ) k=s k=s ∑ P(δn (k) ≥ tn ), (23) where we say two positive sequences an bn if limn→∞ (an /bn ) ≤ 1. [sent-815, score-0.516]

89 It follows that n n p−1 p−1 ∑ P(δn (k) ≥ tn ) = k=s k=s ∑ P(δn (k) ≥ tn |Vn (k) 1 p−1 = o( ) ∑ P(Vn (k) n k=s Vn (k + 1))P(Vn (k) Vn (k + 1)) Vn (k + 1)). [sent-820, score-0.458]

90 By the deﬁnition of sn , the event {sn < s|Dn )} is contained in the event {δn (s − 1) < tn |Dn }. [sent-829, score-0.419]

91 (2) (1) (1) (29) (2) Last, split xis into two terms, xis = xis +xis such that xis ∈ Vn (s−1) and xis ∈ Vn (s)∩(Vn (s−1))⊥ . [sent-850, score-1.27]

92 Recall that xis is contained in the one dimensional (2) (2) linear space Vn (s) ∩ Vn (s − 1)⊥ , so without loss of generality, assume (xis , q1 ) = xis . [sent-864, score-0.508]

93 Over the event Dn , it follows from the construction of Q and basic algebra that n (2) (H(s) − H(s − 1))(βis xis + z) 2 (2) = ( βis xis + z1 ) 2 . [sent-871, score-0.553]

94 (2) P( (H(s) − H(s − 1))Y < tn |Dn ) = P(( βis xis + z1 )2 < tn |Dn ). [sent-873, score-0.712]

95 So by the deﬁnition of ∆∗ = ∆n (β, X, p), n (2) βis xis and (2) P(( βis xis ≥ ∆∗ , n (2) + z1 )2 < tn |Dn ) ≤ P( βis xis Recalling that z1 ∼ N(0, σ2 ) and that P(Dc ) = o(1), n n + z1 < tn |Dn ) ≤ P(∆∗ + z1 < tn |Dn ). [sent-875, score-1.449]

96 n P(∆∗ + z1 < tn |Dn ) ≤ P(∆∗ + z1 < tn ) + o(1). [sent-876, score-0.458]

97 Combining this with (34)-(35) n gives (2) 2 P(( βis xis + z1 )2 < tn |Dn ) = o(1). [sent-878, score-0.483]

98 Since ex − 1 ≤ x + ex x2 /2, it follows from Taylor expansion that εn gi (tn ) = εn [etn u(xi ,x j ) − 1] ≤ εn ∑ Eπn [tn u(xi , x j ) + ¯ j=i By deﬁnitions mn (X) of 2 εn ∑ j=i Eπn [tn u2 (xi , x j )2 ] 2 (2) = tn µn v2 (X). [sent-902, score-0.366]

99 n (1) v2 (X), n and εn ∑ j=i Eπ [tn u(xi , x j )] = tn µn mn (X), and It follows from (12) that (2) (1) εn gi (tn ) ≤ q log(p) · [ ¯ ec1 q 2 2 t u (xi , x j )2 ]. [sent-903, score-0.366]

100 n Since dn (β|X) ≤ p for any variable selection procedure β, Lemma 12 implies that the overall con∗ tribution of Dc to the Hamming distance dn (β) is o(1/p). [sent-928, score-0.404]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('css', 0.389), ('lasso', 0.316), ('cns', 0.307), ('xis', 0.254), ('vn', 0.235), ('tn', 0.229), ('marginal', 0.191), ('dn', 0.165), ('arginal', 0.152), ('enovese', 0.152), ('yao', 0.145), ('hamming', 0.143), ('asso', 0.137), ('wasserman', 0.135), ('gi', 0.113), ('sgn', 0.112), ('donoho', 0.111), ('incoherence', 0.111), ('omparison', 0.106), ('sn', 0.1), ('faithfulness', 0.098), ('recovery', 0.097), ('egression', 0.085), ('regression', 0.083), ('ik', 0.083), ('condition', 0.082), ('coordinates', 0.078), ('fix', 0.076), ('exact', 0.073), ('ey', 0.07), ('mr', 0.07), ('ci', 0.061), ('nn', 0.06), ('antn', 0.059), ('bn', 0.058), ('reconstruction', 0.055), ('xs', 0.053), ('cnscss', 0.049), ('gram', 0.047), ('experiment', 0.047), ('tuning', 0.046), ('regions', 0.046), ('event', 0.045), ('jin', 0.043), ('variable', 0.042), ('calibrates', 0.042), ('log', 0.042), ('lim', 0.04), ('csub', 0.04), ('bi', 0.039), ('region', 0.038), ('qk', 0.037), ('cand', 0.037), ('idealized', 0.037), ('row', 0.037), ('cients', 0.037), ('dy', 0.035), ('iid', 0.035), ('zi', 0.035), ('elad', 0.035), ('signal', 0.034), ('choices', 0.034), ('hexagon', 0.034), ('dim', 0.034), ('selection', 0.032), ('dc', 0.032), ('wainwright', 0.032), ('coef', 0.032), ('conditions', 0.031), ('algebraically', 0.031), ('diagram', 0.031), ('nonzero', 0.03), ('etn', 0.03), ('etzi', 0.03), ('jiashun', 0.03), ('parallelogram', 0.03), ('diagonal', 0.029), ('efron', 0.028), ('winner', 0.028), ('dii', 0.028), ('hs', 0.028), ('meinshausen', 0.027), ('maxi', 0.027), ('procedures', 0.027), ('di', 0.026), ('limn', 0.026), ('genovese', 0.025), ('huo', 0.025), ('signals', 0.025), ('lemma', 0.025), ('critical', 0.025), ('mn', 0.024), ('spirtes', 0.023), ('correlation', 0.023), ('coordinate', 0.023), ('matrix', 0.023), ('max', 0.023), ('noisy', 0.023), ('xi', 0.023), ('says', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao

2 0.17344403 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Author: Jian Huang, Cun-Hui Zhang

Abstract: The ℓ1 -penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1 -penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1 -penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. Keywords: variable selection, penalized estimation, oracle inequality, generalized linear models, selection consistency, sparsity

3 0.11831932 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

Author: Ji Liu, Peter Wonka, Jieping Ye

Abstract: We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X ∈ Rn×m (m ≫ n) and a noisy observation vector y ∈ Rn satisfying y = Xβ∗ + ε where ε is the noise vector following a Gaussian distribution N(0, σ2 I), how to recover the signal (or parameter vector) β∗ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively reﬁnes the target signal β∗ . We show that if X obeys a certain condition, then with a large probability ˆ the difference between the solution β estimated by the proposed method and the true solution β∗ measured in terms of the ℓ p norm (p ≥ 1) is bounded as ˆ β − β∗ p ≤ C(s − N)1/p log m + ∆ σ, where C is a constant, s is the number of nonzero entries in β∗ , the risk of the oracle estimator ∆ is independent of m and is much smaller than the ﬁrst term, and N is the number of entries of β∗ larger √ than a certain value in the order of O (σ log m). The proposed √ method improves the estimation √ bound of the standard Dantzig selector approximately from Cs1/p log mσ to C(s − N)1/p log mσ where the value N depends on the number of large entries in β∗ . When N = s, the proposed algorithm achieves the oracle solution with a high probability, where the oracle solution is the projection of the observation vector y onto true features. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. Finally, we extend this multi-stage procedure to the LASSO case. Keywords: multi-stage, Dantzig selector, LASSO, sparse signal recovery

4 0.11425628 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

Author: Lan Xue, Annie Qu

Abstract: The varying-coefﬁcient model is ﬂexible and powerful for modeling the dynamic changes of regression coefﬁcients. It is important to identify signiﬁcant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefﬁcient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefﬁcient modeling could be inﬁnite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefﬁcient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efﬁcient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty

5 0.085502625 113 jmlr-2012-The huge Package for High-dimensional Undirected Graph Estimation in R

Author: Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, Larry Wasserman

Abstract: We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides ﬁtting Gaussian graphical models, it also provides functions for ﬁtting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation and graph visualization; (4) a minor convergence problem of the graphical lasso algorithm is corrected; (5) the package allows the user to apply both lossless and lossy screening rules to scale up large-scale problems, making a tradeoff between computational and statistical efﬁciency. Keywords: high-dimensional undirected graph estimation, glasso, huge, semiparametric graph estimation, data-dependent model selection, lossless screening, lossy screening 1. Overview Undirected graphs is a natural approach to describe the conditional independence among many variables. Each node of the graph represents a single variable and no edge between two variables implies that they are conditional independent given all other variables. In the past decade, significant progress has been made on designing efﬁcient algorithms to learn undirected graphs from high-dimensional observational data sets. Most of these methods are based on either the penalized maximum-likelihood estimation (Friedman et al., 2007) or penalized regression methods (Meinshausen and B¨ hlmann, 2006). Existing packages include glasso, Covpath and CLIME. In particuu ∗. Also in the Department of Biostatistics. †. Also in the Department of Machine Learning. c 2012 Zhao, Liu, Roeder, Lafferty and Wasserman. Z HAO , L IU , ROEDER , L AFFERTY AND WASSERMAN lar, the glasso package has been widely adopted by statisticians and computer scientists due to its friendly user-inference and efﬁciency. In this paper1 we describe a newly developed R package named huge (High-dimensional Undirected Graph Estimation) coded in C. The package includes a wide range of functional modules and addresses some drawbacks of the graphical lasso algorithm. To gain more scalability, the package supports two modes of screening, lossless (Witten et al., 2011) and lossy screening. When using lossy screening, the user can select the desired screening level to scale up for high-dimensional problems, but this introduces some estimation bias. 2. Software Design and Implementation The package huge aims to provide a general framework for high-dimensional undirected graph estimation. The package includes Six functional modules (M1-M6) facilitate a ﬂexible pipeline for analysis (Figure 1). M1. Data Generator: The function huge.generator() can simulate multivariate Gaussian data with different undirected graphs, including hub, cluster, band, scale-free, and Erd¨ s-R´ nyi o e random graphs. The sparsity level of the obtained graph and signal-to-noise ratio can also be set up by users. M2. Semiparametric Transformation: The function huge.npn() implements the nonparanormal method (Liu et al., 2009, 2012) for estimating a semiparametric Gaussian copula model.The nonparanormal family extends the Gaussian distribution by marginally transforming the variables. Computationally, the nonparanormal transformation only requires one pass through the data matrix. M3. Graph Screening: The scr argument in the main function huge() controls the use of largescale correlation screening before graph estimation. The function supports the lossless screening (Witten et al., 2011) and the lossy screening. Such screening procedures can greatly reduce the computational cost and achieve equal or even better estimation by reducing the variance at the expense of increased bias. Figure 1: The graph estimation pipeline. M4. Graph Estimation: Similar to the glasso package, the method argument in the huge() function supports two estimation methods: (i) the neighborhood pursuit algorithm (Meinshausen and B¨ hlmann, 2006) and (ii) the graphical lasso algorithm (Friedman et al., 2007). We apply u the coordinate descent with active set and covariance update, as well as other tricks suggested in Friedman et al. (2010). We modiﬁed the warm start trick to address the potential divergence problem of the graphical lasso algorithm (Mazumder and Hastie, 2011). The code is also memory-optimized using the sparse matrix data structure when estimating and storing full regularization paths for large 1. This paper is only a summary of the package huge. For more details please refer to the online vignette. 1060 H IGH - DIMENSIONAL U NDIRECTED G RAPH E STIMATION data sets. we also provide a complementary graph estimation method based on thresholding the sample correlation matrix, which is computationally efﬁcient and widely applied in biomedical research. M5. Model Selection: The function huge.select() provides two regularization parameter selection methods: the stability approach for regularization selection (StARS) (Liu et al., 2010); and rotation information criterion (RIC). We also provide a likelihood-based extended Bayesian information criterion. M6. Graph Visualization: The plotting functions huge.plot() and plot() provide visualizations of the simulated data sets, estimated graphs and paths. The implementation is based on the igraph package. 3. User Interface by Example We illustrate the user interface by analyzing a stock market data which we contribute to the huge package. We acquired closing prices from all stocks in the S&P; 500 for all the days that the market was open between Jan 1, 2003 and Jan 1, 2008. This gave us 1258 samples for the 452 stocks that remained in the S&P; 500 during the entire time period. > > > > > library(huge) data(stockdata) # Load the data x = log(stockdata$data[2:1258,]/stockdata$data[1:1257,]) # Preprocessing x.npn = huge.npn(x, npn.func=

6 0.081985265 68 jmlr-2012-Minimax Manifold Estimation

7 0.076310806 111 jmlr-2012-Structured Sparsity and Generalization

8 0.07568635 20 jmlr-2012-Analysis of a Random Forests Model

9 0.069872178 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

10 0.06873031 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

11 0.063767232 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

12 0.06220077 13 jmlr-2012-Active Learning via Perfect Selective Classification

13 0.059676055 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

14 0.058113161 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

15 0.055168513 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

16 0.050717406 59 jmlr-2012-Linear Regression With Random Projections

17 0.049548291 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning

18 0.048861776 82 jmlr-2012-On the Necessity of Irrelevant Variables

19 0.046666872 97 jmlr-2012-Regularization Techniques for Learning with Matrices

20 0.044820115 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.235), (1, 0.158), (2, -0.154), (3, -0.093), (4, -0.037), (5, 0.187), (6, -0.09), (7, -0.181), (8, -0.021), (9, 0.009), (10, 0.109), (11, 0.004), (12, -0.033), (13, -0.028), (14, 0.057), (15, 0.007), (16, -0.09), (17, 0.083), (18, -0.044), (19, -0.124), (20, -0.129), (21, 0.01), (22, 0.159), (23, -0.159), (24, 0.017), (25, -0.119), (26, -0.005), (27, 0.174), (28, 0.102), (29, -0.064), (30, -0.001), (31, -0.067), (32, 0.025), (33, -0.108), (34, -0.004), (35, -0.014), (36, -0.022), (37, -0.05), (38, 0.011), (39, 0.099), (40, -0.04), (41, -0.012), (42, -0.023), (43, -0.01), (44, -0.018), (45, 0.135), (46, 0.013), (47, -0.047), (48, -0.183), (49, -0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9490726 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao

2 0.76958936 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Author: Jian Huang, Cun-Hui Zhang

3 0.56086355 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

Author: Lan Xue, Annie Qu

4 0.50126618 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

Author: Ji Liu, Peter Wonka, Jieping Ye

5 0.39368179 111 jmlr-2012-Structured Sparsity and Generalization

Author: Andreas Maurer, Massimiliano Pontil

Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an inﬁnite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.

6 0.3654682 113 jmlr-2012-The huge Package for High-dimensional Undirected Graph Estimation in R

7 0.35257685 68 jmlr-2012-Minimax Manifold Estimation

8 0.3353326 20 jmlr-2012-Analysis of a Random Forests Model

9 0.31855938 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

10 0.31435335 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables

11 0.28404495 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

12 0.26926479 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

13 0.26617342 59 jmlr-2012-Linear Regression With Random Projections

14 0.26158598 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

15 0.25942352 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

16 0.2546674 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

17 0.24232177 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

18 0.23745701 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

19 0.23357026 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

20 0.21682163 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.082), (26, 0.033), (29, 0.047), (35, 0.014), (49, 0.012), (57, 0.014), (64, 0.011), (75, 0.033), (77, 0.394), (79, 0.015), (92, 0.142), (96, 0.096)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8955934 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

Author: Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller

Abstract: We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identiﬁed with near certainty, they can be conditioned upon, allowing further inference to be done efﬁciently. Speciﬁcally, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-speciﬁc training data. While OCR systems produce conﬁdence measures for the identity of each letter or word, thresholding these values still produces a signiﬁcant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difﬁcult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-speciﬁc character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system’s translation.1 Keywords: optical character recognition, probability bounding, document-speciﬁc modeling, computer vision 1. This work is an expanded and revised version of Kae et al. (2010). Supported by NSF Grant IIS-0916555. c 2012 Gary B. Huang, Andrew Kae, Carl Doersch and Erik Learned-Miller. H UANG , K AE , D OERSCH AND L EARNED -M ILLER

same-paper 2 0.8173725 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao

3 0.78937984 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

Abstract: Support vector regression (SVR) and support vector classiﬁcation (SVC) are popular learning techniques, but their use with kernels is often time consuming. Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/testing. However, few studies have focused on linear SVR. In this paper, we extend state-of-theart training methods for linear SVC to linear SVR. We show that the extension is straightforward for some methods, but is not trivial for some others. Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efﬁciently produce models that are as good as kernel SVR. Keywords: support vector regression, Newton methods, coordinate descent methods

4 0.51137304 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin

Abstract: Recently, Yuan et al. (2010) conducted a comprehensive comparison on software for L1-regularized classiﬁcation. They concluded that a carefully designed coordinate descent implementation CDN is the fastest among state-of-the-art solvers. In this paper, we point out that CDN is less competitive on loss functions that are expensive to compute. In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations. In optimization, Newton methods are known to have fewer iterations although each iteration costs more. Because solving the Newton sub-problem is independent of the loss calculation, this type of methods may surpass CDN under some circumstances. In L1-regularized classiﬁcation, GLMNET by Friedman et al. is already a Newton-type method, but experiments in Yuan et al. (2010) indicated that the existing GLMNET implementation may face difﬁculties for some largescale problems. In this paper, we propose an improved GLMNET to address some theoretical and implementation issues. In particular, as a Newton-type method, GLMNET achieves fast local convergence, but may fail to quickly obtain a useful solution. By a careful design to adjust the effort for each iteration, our method is efﬁcient for both loosely or strictly solving the optimization problem. Experiments demonstrate that our improved GLMNET is more efﬁcient than CDN for L1-regularized logistic regression. Keywords: L1 regularization, linear classiﬁcation, optimization methods, logistic regression, support vector machines

5 0.49461824 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi

Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufﬁcient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufﬁcient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that ﬁnite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression

6 0.48340985 82 jmlr-2012-On the Necessity of Irrelevant Variables

7 0.47736406 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

8 0.47430968 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

9 0.46968025 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

10 0.46115255 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

11 0.45838642 20 jmlr-2012-Analysis of a Random Forests Model

12 0.45721346 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

13 0.45168573 111 jmlr-2012-Structured Sparsity and Generalization

14 0.4501949 73 jmlr-2012-Multi-task Regression using Minimal Penalties

15 0.44543093 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

16 0.44433337 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

17 0.4435412 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

18 0.44278133 68 jmlr-2012-Minimax Manifold Estimation

19 0.43778419 109 jmlr-2012-Stability of Density-Based Clustering

20 0.43745977 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming