jmlr jmlr2006 jmlr2006-66 knowledge-graph by maker-knowledge-mining

66 jmlr-2006-On Model Selection Consistency of Lasso

Source: pdf

Author: Peng Zhao, Bin Yu

Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to ﬁnd such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufﬁcient for Lasso to select the true model both in the classical ﬁxed p setting and in the large p setting as the sample size n gets large. Based on these results, sufﬁcient conditions that are veriﬁable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clariﬁed) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Based on these results, sufﬁcient conditions that are veriﬁable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. [sent-10, score-0.072]

2 Keywords: Lasso, regularization, sparsity, model selection, consistency 1. [sent-13, score-0.106]

3 We make this assessment by investigating Lasso’s model selection consistency c 2006 Peng Zhao and Bin Yu. [sent-24, score-0.14]

4 ith column (ith predictor) and x j The model is assumed to be “sparse”, that is, some of the regression coefﬁcients β n are exactly zero corresponding to predictors that are irrelevant to the response. [sent-42, score-0.09]

5 Under some regularity conditions on the design, Knight and Fu (2000) have shown estimation consistency for Lasso for ﬁxed p and ﬁxed βn (i. [sent-55, score-0.132]

6 In addition, it is shown in the work that for λn ∝ n 2 (on the same order of n 2 ), as n → ∞ there is a non-vanishing positive probability for lasso to select the true model. [sent-59, score-0.566]

7 On the model selection consistency front, Meinshausen and Buhlmann (2006) have shown that under a set of conditions, Lasso is consistent in estimating the dependency between Gaussian variables even when the number of variables p grows faster than n. [sent-60, score-0.174]

8 In general, as we will show, if an irrelevant predictor is highly correlated with the predictors in the true model, Lasso may not be able to distinguish it from the true predictors with any amount of data and any amount of regularization. [sent-65, score-0.146]

9 In particular, in one example our condition coincides with the “Coherence” condition in Donoho et al. [sent-69, score-0.128]

10 Our analytical approach is direct and we thoroughly explain through special cases and simulations the meaning of this condition in various cases. [sent-73, score-0.08]

11 We then elaborate on the condition by extending to other sufﬁcient conditions that are more intuitive and veriﬁable to relate to previous theoretical and simulation studies of Lasso. [sent-81, score-0.141]

12 Sections 3 contains simulation results to illustrate our result and to build heuristic sense of how strong the condition is. [sent-82, score-0.187]

13 To conclude, Section 4 compares Lasso with thresholding and discusses alternatives and possible modiﬁcations of Lasso to achieve selection consistency when Irrepresentable Condition fails. [sent-83, score-0.131]

14 However, to separate the selection aspect of the consistency from the parameter estimation aspect, we make the following deﬁnitions about Sign Consistency that does not assume the estimates to be estimation consistent. [sent-88, score-0.109]

15 ˆ Deﬁnition 1 An estimate βn is equal in sign with the true model βn which is written ˆ βn = s βn if and only if ˆ sign(βn ) = sign(βn ) ˆ where sign(·) maps positive entry to 1, negative entry to -1 and zero to zero, that is, βn matches the zeros and signs of β. [sent-89, score-0.207]

16 Sign consistency is stronger than the usual selection consistency which only requires the zeros to be matched, but not the signs. [sent-90, score-0.184]

17 We also argue that an estimated model with reversed signs can be misleading and hardly qualiﬁes as a correctly selected model. [sent-93, score-0.074]

18 Now we deﬁne two kinds of sign consistencies for Lasso depending on how the amount of regularization is determined. [sent-94, score-0.191]

19 n→∞ Strong Sign Consistency implies one can use a preselected λ to achieve consistent model selection via Lasso. [sent-97, score-0.099]

20 Surprisingly, as implied by our results, the two kinds of sign consistencies are almost equivalent to one condition. [sent-100, score-0.166]

21 The Irrepresentable Conditions closely resembles a regularization constraint on the regression coefﬁcients of the irrelevant covariates (Xn (2))) on the relevant covariates (Xn (1)). [sent-134, score-0.249]

22 As a preparatory result, the following proposition puts a lower bound on the probability of Lasso picking the true model which quantitatively relates the probability of Lasso selecting the correct model and how well Strong Irrepresentable Condition holds: Proposition 1. [sent-137, score-0.121]

23 Our main results relate Strong and Weak Irrepresentable Conditions with strong and general sign consistency. [sent-147, score-0.231]

24 n 1≤i≤n i (4) In practice, the covariates are usually scaled so that the diagonal elements of C n are all 1’s. [sent-157, score-0.103]

25 For ﬁxed q, p and βn = β, under regularity conditions (3) and (4), Lasso is strongly sign consistent if Strong Irrepresentable Condition holds. [sent-169, score-0.224]

26 Therefore Strong Irrepresentable Condition allows for consistent model selection and parameter estimation simultaneously. [sent-174, score-0.099]

27 On the other hand, Theorem 2 shows that Weak Irrepresentable Condition is also necessary even for the weaker general sign consistency. [sent-175, score-0.133]

28 For ﬁxed p, q and βn = β, under regularity conditions (3) and (4), Lasso is general sign consistent only if there exists N so that Weak Irrepresentable Condition holds for n > N. [sent-177, score-0.252]

29 Therefore, Strong Irrepresentable Condition implies strong sign consistency implies general sign consistency implies Weak Irrepresentable Condition. [sent-179, score-0.5]

30 So except for the technical difference between the two conditions, Irrepresentable Condition is almost necessary and sufﬁcient for both strong sign consistency and general sign consistency. [sent-180, score-0.425]

31 Furthermore, under additional regularity conditions on the noise terms ε n , this “small” p result i can be extended to the “large” p case. [sent-181, score-0.075]

32 2 Model Selection Consistency for Large p and q In the large p and q case, we allow the dimension of the designs C n and model parameters βn grow as n grows, that is, p = pn and q = qn are allowed to grow with n. [sent-184, score-0.171]

33 (6) requires the design of the relevant covariates have eigenvalues bounded from below so that the inverse 2546 O N M ODEL S ELECTION C ONSISTENCY OF L ASSO n of C11 behaves well. [sent-192, score-0.154]

34 Condition (7) is a sparsity assumption which requires square root of the size of the true √ model qn to grow at a rate slower than the rate gap which consequently prevents the estimation bias of the Lasso solutions from dominating the model parameters. [sent-197, score-0.107]

35 Under conditions (5), (6), (7) and (8), Strong Irrepresentable Condition implies that Lasso has strong sign consistency for pn = o(n(c2 −c1 )k ). [sent-203, score-0.337]

36 Theorem 3 states that Lasso can select the true model consistently given that Strong Irrepresentable Condition holds and the noise terms have some ﬁnite moments. [sent-206, score-0.077]

37 If all moments of the noise exist then, by Theorem 3, p can grow at any polynomial rate and the probability of Lasso selecting the true model converges to 1 at a faster rate than any polynomial rate. [sent-208, score-0.11]

38 Under conditions i c (5), (6), (7) and (8), if there exists 0 ≤ c3 < c2 −c1 for which pn = O(en 3 ) then strong Irrepresentable 1+c4 Condition implies that Lasso has strong sign consistency. [sent-214, score-0.346]

39 It is an encouraging result that using Lasso we can allow p to grow much faster than n (up to exponentially fast) while still allow for fast convergence of the probability of correct model selection to 1. [sent-220, score-0.092]

40 In general, the result of Theorem 3 is tight in the sense that if higher moments of the noise distribution do not exist then the tail probability of the noise terms does not vanish quick enough to allow p to grow at higher degree polynomial rates. [sent-222, score-0.105]

41 A closer look discloses that (2) does not depend on C22 , that is, the covariance n , the correlations between of the covariates that are not in the true model. [sent-230, score-0.123]

42 It linearly depends on C21 n the covariates that are in the model and the ones that are not. [sent-231, score-0.134]

43 For the C11 part, except for special cases (Corollary 1) we also want the correlations between covariates that are in the model to be n n small otherwise C11 may contain small eigenvalues which leads to large eigenvalues for (C11 )−1 and results in the violation of (2). [sent-232, score-0.196]

44 All diagonal elements of Cn are assumed to be 1 which is equivalent to normalizing all covariates in the model to the same scale since Strong Irrepresentable Condition is invariant under any common scaling of C n . [sent-234, score-0.134]

45 Corollary 1 has particularly strong implications for applications of Lasso where the covariates of the regression are designed with a symmetry so that the covariates share a constant correlation. [sent-253, score-0.29]

46 Under such a design, this result implies that Strong Irrepresentable Condition holds even for p growing with n as long as q remains ﬁxed and consequently ensures that Lasso selects the true model asymptotically. [sent-254, score-0.075]

47 1  2548 O N M ODEL S ELECTION C ONSISTENCY OF L ASSO 1 with r ≤ − 2q−1 and make the nonzero βi ’s all positive then |C21 (C11 )−1 sign(β(1))| ≥ 1 holds element-wise which fails Strong Irrepresentable Condition. [sent-278, score-0.069]

48 Although this design introduces more sophisticated correlation structure between the predictors and does not seem restrictive, the following corollary states under this design Strong Irrepresentable Condition holds for any q. [sent-288, score-0.191]

49 In addition, as instances of Corollary 2, under some simpliﬁed designs which are often used for theoretical studies, Lasso is consistent for model selection. [sent-294, score-0.112]

50 If • the design is orthogonal, or • q = 1 and the predictors are normalized with correlations bounded from 1, or • p = 2 and the predictors are normalized with correlations bounded from 1 then Strong Irrepresentable Condition holds. [sent-296, score-0.156]

51 As it is commonly assumed in practice, this assumed scenario is a hybrid between the most highly structured designs like the orthogonal design and a general design. [sent-298, score-0.077]

52 , bn ) to correspond to different blocks, Strong Irrepresentable Condi1 k tion holds if and only if there exists a common 0 < η ≤ 1 for which Strong Irrepresentable Condition holds for all Bn and bn , j = 1, . [sent-318, score-0.228]

53 2549 Z HAO AND Y U Through Corollaries 1 - 5, we have shown that under speciﬁc designs, which are commonly used or assumed in previous works, Irrepresentable Condition holds which leads to Lasso’s consistency in model selection. [sent-323, score-0.134]

54 Next, we demonstrate Lasso’s model selection consistency and the Irrepresentable Conditions using simulations. [sent-324, score-0.14]

55 The second simulation quantitatively relates the consistency (inconsistency) of Lasso to how well the Strong Irrepresentable Condition holds (fails) by counting the percentages of Lasso selecting the true model and comparing it to η ∞ = n n 1 − C21 (C11 )−1 sign(βn ) ∞ . [sent-330, score-0.206]

56 In the last simulation, we establish a heuristic sense of how strong (1) our Strong Irrepresentable Condition is for different values of p and q by observing how often the condition holds when C is sampled from Wishart(p, p) distribution. [sent-331, score-0.176]

57 Now we investigate how these two set-ups lead to Lasso’s sign consistency and inconsistency respectively. [sent-350, score-0.232]

58 Therefore for Lasso to be sign ˆ and X2 ’s inner products with Y agree with the signs of β consistent, the signs of β1 and β2 has to disagree which happens in setting (b) but not setting (a). [sent-369, score-0.219]

59 To evaluate how well the Irrepresentable condition holds we n n calculate η∞ = 1 − C21 (C11 )−1 sign(βn ) ∞ . [sent-381, score-0.092]

60 For each design, we run the simulation 1000 times and examine general sign consistencies. [sent-392, score-0.172]

61 The entire path is examined to see if there exists a model estimate that matches the signs of the true model. [sent-396, score-0.074]

62 In general, this is consistent with our result (Proposition 1 and Theorem 1 to 4), as for η∞ > 0, if n is large enough, the probability of Lasso selects the true model gets close to 1 which does not happen if η ∞ < 0. [sent-403, score-0.081]

63 This quantitatively illustrates the importance of Strong Irrepresentable Condition for Lasso’s model selection performance. [sent-404, score-0.083]

64 In this simulation, we establish some heuristic sense of how strong our Strong Irrepresentable Condition is for different values of p and q. [sent-411, score-0.084]

65 Since Strong Irrepresentable Condition holds for designs that are close to orthogonal 2552 O N M ODEL S ELECTION C ONSISTENCY OF L ASSO q= q= q= q= q= q= q= 1 8p 2 8p 3 8p 4 8p 5 8p 6 8p 7 8p p = 23 100% 72. [sent-414, score-0.075]

66 The entries of β(1) are assumed to be positive, otherwise a sign ﬂip of the corresponding Xi ’s can make the corresponding βi positive. [sent-444, score-0.146]

67 Discussions In this paper, we have provided Strong and Weak Irrepresentable Conditions that are almost necessary and sufﬁcient for model selection consistency of Lasso under both small p and large p settings. [sent-450, score-0.14]

68 Thus, alternative computationally feasible methods that lead to selection consistency when the condition fails are of interest. [sent-455, score-0.197]

69 In particular, for small p cases, if consistency is the only concern then thresholding (either hard or soft) is an obvious choice that guarantees consistent selection. [sent-456, score-0.131]

70 However, as emphasized earlier, consistency does not mean good performance in ﬁnite sample which is what matters in many applications where Lasso-type of technique is used. [sent-458, score-0.075]

71 2 and examined the sign matching rate of thresholding and compare it to the Lasso’s performance. [sent-462, score-0.155]

72 This is loosely justiﬁed since, for instance, from Knight and Fu (2000) we know Lasso is consistent for λ = o(n) and therefore can pick up all the true predictors if the amount of data is sufﬁcient (although it may over-select). [sent-471, score-0.092]

73 Finally, we think it is possible to directly construct an alternative regularization to Lasso that selects model consistently under much weaker conditions and at the same time remains computationally feasible. [sent-472, score-0.098]

74 When Strong Irrepresentable Condition fails, the irrelevant covariates are correlated with the relevant covariates enough to be picked up by Lasso to compensate the over-shrinkage of the nonzero β’s. [sent-474, score-0.253]

75 β i=1 ˆ Let un = βn − βn , and deﬁne ˆ n Vn (un ) = ∑ [(εi − Xi un )2 − ε2 ] + λn un + β 1 , i i=1 we have n un = arg min[ ∑ (εi − Xi un )2 + λn un + β 1 ]. [sent-497, score-0.456]

76 n (9) u The ﬁrst summation in Vn (un ) can be simpliﬁed as follows: n ∑ [(εi − Xi un )2 − ε2 ] i i=1 n = ∑ [−2εi Xi un + (un )T XiT Xi un ], i=1 √ √ √ = −2W n ( nun ) + ( nun )T Cn ( nun ), √ where W n = (X n )T εn / n. [sent-499, score-0.489]

77 un and √ √ √ √ √ d[−2W n ( nun ) + ( nun )T Cn ( nun )] = 2 n(Cn ( nun ) −W n ). [sent-503, score-0.424]

78 n du (10) (11) Let un (1), W n (1) and un (2), W n (2) denote the ﬁrst q and last p − q entries of un and W n respecˆ ˆ ˆ tively. [sent-504, score-0.241]

79 j j (1) ˆ (1) Then by Lemma 1, (9), (11) and uniqueness of Lasso solutions, if there exists u n , the following ˆ holds λn n √ C11 ( nun (1)) −W n (1) = − √ sign(βn ), ˆ (1) 2 n |un (1)| < |βn |, ˆ (1) λn λn n √ ˆ − √ 1 ≤ C21 ( nun (1)) −W n (2) ≤ √ 1. [sent-510, score-0.202]

80 2 n 2 n ˆ ˆ then sign(βn ) = sign(βn ) and βn = un (2) = 0. [sent-511, score-0.076]

81 Whereas 1 − P(An ∩ Bn ) ≤ P(Ac ) + P(Bc ) n n q ≤ ∑ P(|zn| ≥ i i=1 p−q √ λn λn n(|βn | − bn ) + ∑ P(|ζn | ≥ √ ηi ). [sent-517, score-0.086]

82 Since λn n → 0, λn n 1+c 2 → ∞ with 0 ≤ c < 1, p, q and βn are all ﬁxed, therefore q ∑ P(|zn| ≥ i i=1 √ λn n(|βn | − bn ) i 2n i q 1 1 ≤ (1 + o(1)) ∑ (1 − Φ((1 + o(1)) n 2 |βi |)) s i=1 c = o(e−n ), and p−q c λn 1 λn P(|ζn | ≥ √ ηi ) = ∑ (1 − Φ( √ ηi )) = o(e−n ). [sent-531, score-0.086]

83 Therefore by Lemma 1 and (11) from the (1) (2) proof of Proposition 1, we have λn λn n √ ˆ ˆ C11 ( nun (1)) −W n (1) = − √ sign(βn ) = − √ sign(βn ) (1) (1) 2 n 2 n λn n √ √ 1 |C21 ( nun (1)) −W n (2)| ≤ ˆ 2 n (15) (16) which hold over F1n . [sent-537, score-0.174]

84 Re-write (16) by replacing un (1) using (15), we get ˆ √ √ n n F1n ⊂ F2n := {(λn /2 n)Ln ≤ C21 (C11 )−1W n (1) −W n (2) ≤ (λn /2 n)U n } where n n Ln = −1 +C21 (C11 )−1 sign(βn ), (1) n n U n = 1 +C21 (C11 )−1 sign(βn ). [sent-538, score-0.076]

85 As in the proof of Theorem 1, we have 1 − P(An ∩ Bn ) ≤ P(Ac ) + P(Bc ) n n q ≤ ∑ P(|zn| ≥ i i=1 p−q √ λn λn n(|βn | − bn ) + ∑ P(|ζn | ≥ √ ηi ). [sent-549, score-0.086]

86 2557 Z HAO AND Y U Therefore zn = (ha ) ε with i i ha i 2 2 ≤ 1 for ∀i = 1, . [sent-564, score-0.105]

87 n Since I − Xn (1)((Xn (1) (Xn (1))−1 Xn (1) has eigenvalues between 0 and 1, therefore ζn = (hb ) ε i i with hb 2 ≤ M1 for ∀i = 1, . [sent-571, score-0.069]

88 (18) i 2 Also notice that, | λn λn n λn bn | = |(C11 )−1 sign(βn )| ≤ sign(βn ) (1) (1) n n nM2 2 = λn √ q nM2 (19) Proof of Theorem 3. [sent-574, score-0.086]

89 i c2 −c1 √ Therefore, for λ/ n = o(n 2 ), using (19), we get q ∑ P(|zn| > i i=1 √ λn n(|βn | − bn ) i 2n i = qO(n−kc2 ) = o( pnk ). [sent-580, score-0.11]

90 2558 O N M ODEL S ELECTION C ONSISTENCY OF L ASSO Using the tail probability bound (14) on Gaussian random variables , for λ n ∝ n immediately have q ∑ P(|zn| > i i=1 1+c4 2 , by (19) we √ λn n(|βn | − bn ) i 2n i c3 = q · O(1 − Φ( (1 + o(1))M3 M2 nc2 /2 ) = o(e−n ). [sent-591, score-0.109]

91 Therefore the inversion of     n C11 =    1 rn rn 1 . [sent-612, score-0.084]

92  1 rn  rn 1 q×q can be obtained by applying the formula and taking reciprocal of the eigenvalues which gets us     n (C11 )−1 =    c d ··· d c ··· . [sent-627, score-0.118]

93 n Now since C21 = rn × 1(p−q)×q so n n C21 (C11 )−1 = rn (c + (q − 1)d)1(p−q)×q = 2559 rn 1 . [sent-643, score-0.126]

94 2560 O N M ODEL S ELECTION C ONSISTENCY OF L ASSO For ∀k ∈ I2 , assume kl = {i : i < k} ∩ I1 kh = {i : i > k} ∩ I1 . [sent-672, score-0.219]

95 Then by the Markov property, we have x jk ⊥ x jg |(x jkl , x jkh ) for j = 1, . [sent-673, score-0.072]

96 Therefore by the regression interpretation as in (2), to check Strong Irrepresentable Condition for x jk we only need to consider x jkl and x jkh since the rest of the entries are zero by the conditional independence. [sent-676, score-0.085]

97 To further simplify, we assume ρ ≥ 0 (otherwise ρ can be modiﬁed to be positive by ﬂipping the signs of predictors 1, 3, 5, . [sent-677, score-0.086]

98 Now regressing x jk on (x jkl , x jkh ) we get x jkl x jkh Cov( 1 = ρkh −kl  =  )−1 Cov(x jk , ρkh −kl 1  k−k ρkl −k −ρ l ρkl −kh −ρkh −kl ρk−kh −ρkh −k ρkl −kh −ρkh −kl −1 x jkl x jkh ) ρkh −k ρk−kl . [sent-681, score-0.216]

99 (a) Since the correlations are all zero so the condition of Corollary 2 holds for ∀q. [sent-686, score-0.112]

100 A note on the lasso and related procedures in model selection. [sent-744, score-0.597]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('irrepresentable', 0.694), ('lasso', 0.566), ('kh', 0.158), ('sign', 0.133), ('covariates', 0.103), ('asso', 0.087), ('nun', 0.087), ('bn', 0.086), ('strong', 0.084), ('un', 0.076), ('consistency', 0.075), ('onsistency', 0.074), ('hao', 0.074), ('xn', 0.07), ('condition', 0.064), ('zn', 0.064), ('kl', 0.061), ('election', 0.053), ('odel', 0.05), ('hb', 0.048), ('knight', 0.047), ('designs', 0.047), ('predictors', 0.043), ('signs', 0.043), ('rn', 0.042), ('wishart', 0.042), ('ha', 0.041), ('corollary', 0.041), ('jkh', 0.039), ('osborne', 0.039), ('simulation', 0.039), ('cn', 0.035), ('selection', 0.034), ('consistent', 0.034), ('jkl', 0.033), ('zhao', 0.033), ('xin', 0.033), ('regularity', 0.033), ('cinj', 0.032), ('model', 0.031), ('design', 0.03), ('efron', 0.03), ('meinshausen', 0.03), ('cm', 0.029), ('holds', 0.028), ('regularization', 0.027), ('grow', 0.027), ('weak', 0.026), ('proposition', 0.026), ('fu', 0.026), ('fails', 0.024), ('inconsistency', 0.024), ('pnk', 0.024), ('zou', 0.024), ('conditions', 0.024), ('tail', 0.023), ('corollaries', 0.022), ('thresholding', 0.022), ('eigenvalues', 0.021), ('pn', 0.021), ('donoho', 0.02), ('cq', 0.02), ('buhlmann', 0.02), ('correlations', 0.02), ('percentage', 0.019), ('theorem', 0.019), ('moments', 0.019), ('correlation', 0.019), ('quantitatively', 0.018), ('qn', 0.018), ('ols', 0.018), ('noise', 0.018), ('yu', 0.017), ('nonzero', 0.017), ('tibshirani', 0.017), ('implied', 0.017), ('shrinks', 0.016), ('selects', 0.016), ('irrelevant', 0.016), ('simulations', 0.016), ('consistencies', 0.016), ('leng', 0.016), ('necessariness', 0.016), ('peng', 0.016), ('presnell', 0.016), ('yn', 0.015), ('amount', 0.015), ('selecting', 0.015), ('completes', 0.015), ('correlated', 0.014), ('relate', 0.014), ('veri', 0.014), ('xit', 0.013), ('knot', 0.013), ('noises', 0.013), ('coherence', 0.013), ('hoerl', 0.013), ('reciprocal', 0.013), ('statistica', 0.013), ('entries', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

2 0.05609281 83 jmlr-2006-Sparse Boosting

Author: Peter Bühlmann, Bin Yu

Abstract: We propose Sparse Boosting (the SparseL2 Boost algorithm), a variant on boosting with the squared error loss. SparseL2 Boost yields sparser solutions than the previously proposed L2 Boosting by minimizing some penalized L2 -loss functions, the FPE model selection criteria, through smallstep gradient descent. Although boosting may give already relatively sparse solutions, for example corresponding to the soft-thresholding estimator in orthogonal linear models, there is sometimes a desire for more sparseness to increase prediction accuracy and ability for better variable selection: such goals can be achieved with SparseL2 Boost. We prove an equivalence of SparseL2 Boost to Breiman’s nonnegative garrote estimator for orthogonal linear models and demonstrate the generic nature of SparseL2 Boost for nonparametric interaction modeling. For an automatic selection of the tuning parameter in SparseL2 Boost we propose to employ the gMDL model selection criterion which can also be used for early stopping of L2 Boosting. Consequently, we can select between SparseL2 Boost and L2 Boosting by comparing their gMDL scores. Keywords: lasso, minimum description length (MDL), model selection, nonnegative garrote, regression

3 0.043631442 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classiﬁcation), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically deﬁned labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We ﬁnd a broad class of learning algorithms for which estimations of the probability of the classiﬁcation error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.

4 0.037778188 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

Abstract: We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V . Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(ρN ) (with ρ < 1) up to a threshold that depends on the approximation error V − A V , where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. A V = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂αV of the performance measure V with respect to some parameter α of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.

5 0.03558813 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

Author: Régis Vert, Jean-Philippe Vert

Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to inﬁnity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held ﬁxed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classiﬁcation error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the ﬁrst time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation

6 0.033957552 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

7 0.028624007 45 jmlr-2006-Learning Coordinate Covariances via Gradients

8 0.026414035 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

9 0.02586606 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

10 0.02584346 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

11 0.024492901 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

12 0.024211541 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

13 0.020984963 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

14 0.020485112 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

15 0.020309808 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

16 0.019658547 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

17 0.019539569 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

18 0.019495238 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

19 0.018044978 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

20 0.017896924 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.102), (1, -0.01), (2, -0.049), (3, -0.094), (4, -0.04), (5, 0.054), (6, -0.045), (7, 0.025), (8, 0.012), (9, 0.068), (10, 0.005), (11, -0.02), (12, 0.046), (13, -0.006), (14, 0.046), (15, 0.03), (16, -0.045), (17, -0.038), (18, 0.023), (19, 0.044), (20, 0.134), (21, 0.055), (22, -0.119), (23, -0.102), (24, -0.095), (25, -0.246), (26, 0.174), (27, -0.046), (28, -0.128), (29, 0.026), (30, 0.078), (31, 0.147), (32, 0.244), (33, 0.006), (34, 0.025), (35, -0.13), (36, -0.419), (37, -0.177), (38, 0.025), (39, 0.058), (40, -0.304), (41, 0.165), (42, 0.109), (43, 0.22), (44, 0.202), (45, 0.037), (46, -0.152), (47, 0.164), (48, 0.104), (49, -0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97269619 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

2 0.35750332 83 jmlr-2006-Sparse Boosting

Author: Peter Bühlmann, Bin Yu

3 0.19114514 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

4 0.16431943 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

Author: Ron Begleiter, Ran El-Yaniv

Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efﬁciency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds

5 0.16360418 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to inﬁnity. We derive probabilistic ﬁnite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reﬂecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a signiﬁcant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

6 0.15017478 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

7 0.14173707 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

8 0.13559322 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

9 0.119067 54 jmlr-2006-Learning the Structure of Linear Latent Variable Models

10 0.11575985 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

11 0.11035997 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

12 0.10132325 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

13 0.097061291 45 jmlr-2006-Learning Coordinate Covariances via Gradients

14 0.095564596 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

15 0.09469036 2 jmlr-2006-A Graphical Representation of Equivalence Classes of AMP Chain Graphs

16 0.091039076 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

17 0.090535186 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

18 0.087619253 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies

19 0.084213741 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving (Special Topic on Inductive Programming)

20 0.084205911 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.02), (14, 0.021), (19, 0.305), (36, 0.077), (45, 0.039), (50, 0.09), (63, 0.04), (76, 0.024), (78, 0.031), (79, 0.016), (81, 0.069), (84, 0.028), (89, 0.011), (90, 0.034), (91, 0.014), (96, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74249864 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

2 0.69457603 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis

3 0.45031834 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

4 0.43980414 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

Abstract: We introduce an algorithm that simultaneously estimates a classiﬁcation function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to ﬁnd salient variables and estimate how they covary. An efﬁcient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classiﬁcation function and its gradient to the true classiﬁcation function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classiﬁcation

5 0.43439198 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random ﬁeld: that is, the model parameters are estimated on the basis of an initial set of data, and then the ﬁtted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for ﬁtting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the inﬁnite data limit) is provably beneﬁcial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious beneﬁt of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random ﬁeld, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability

6 0.43212059 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

7 0.4312411 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

8 0.42800608 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

9 0.42724198 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

10 0.42686433 48 jmlr-2006-Learning Minimum Volume Sets

11 0.42536485 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

12 0.4228586 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

13 0.42238835 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

14 0.41986862 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

15 0.41417333 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

16 0.41409326 83 jmlr-2006-Sparse Boosting

17 0.40990928 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

18 0.40821391 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

19 0.40492684 53 jmlr-2006-Learning a Hidden Hypergraph

20 0.40189692 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities