jmlr jmlr2006 jmlr2006-83 knowledge-graph by maker-knowledge-mining

83 jmlr-2006-Sparse Boosting

Source: pdf

Author: Peter Bühlmann, Bin Yu

Abstract: We propose Sparse Boosting (the SparseL2 Boost algorithm), a variant on boosting with the squared error loss. SparseL2 Boost yields sparser solutions than the previously proposed L2 Boosting by minimizing some penalized L2 -loss functions, the FPE model selection criteria, through smallstep gradient descent. Although boosting may give already relatively sparse solutions, for example corresponding to the soft-thresholding estimator in orthogonal linear models, there is sometimes a desire for more sparseness to increase prediction accuracy and ability for better variable selection: such goals can be achieved with SparseL2 Boost. We prove an equivalence of SparseL2 Boost to Breiman’s nonnegative garrote estimator for orthogonal linear models and demonstrate the generic nature of SparseL2 Boost for nonparametric interaction modeling. For an automatic selection of the tuning parameter in SparseL2 Boost we propose to employ the gMDL model selection criterion which can also be used for early stopping of L2 Boosting. Consequently, we can select between SparseL2 Boost and L2 Boosting by comparing their gMDL scores. Keywords: lasso, minimum description length (MDL), model selection, nonnegative garrote, regression

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Statistics University of California Berkeley, CA 94720-3860, USA Editors: Yoram Singer and Larry Wasserman Abstract We propose Sparse Boosting (the SparseL2 Boost algorithm), a variant on boosting with the squared error loss. [sent-6, score-0.536]

2 SparseL2 Boost yields sparser solutions than the previously proposed L2 Boosting by minimizing some penalized L2 -loss functions, the FPE model selection criteria, through smallstep gradient descent. [sent-7, score-0.146]

3 We prove an equivalence of SparseL2 Boost to Breiman’s nonnegative garrote estimator for orthogonal linear models and demonstrate the generic nature of SparseL2 Boost for nonparametric interaction modeling. [sent-9, score-0.383]

4 For an automatic selection of the tuning parameter in SparseL2 Boost we propose to employ the gMDL model selection criterion which can also be used for early stopping of L2 Boosting. [sent-10, score-0.209]

5 Introduction Since its inception in a practical form in Freund and Schapire (1996), boosting has obtained and maintained its outstanding performance in numerous empirical studies both in the machine learning and statistics literatures. [sent-13, score-0.479]

6 The gradient descent view of boosting as articulated in Breiman (1998, 1999), Friedman et al. [sent-14, score-0.479]

7 (2001) provides a springboard for the understanding a of boosting to leap forward and at the same time serves as the base for new variants of boosting to be generated. [sent-16, score-1.064]

8 u ¨ B UHLMANN AND Y U special case calculation implies that the resistance to the over-ﬁtting behavior of boosting could be due to the fact that the complexity of boosting increases at an extremely slow pace. [sent-22, score-0.958]

9 For high-dimensional linear regression (or classiﬁcation) problems with many ineffective predictor variables, the Lasso estimate can be very poor in terms of prediction accuracy and as a variable selection method, see Meinshausen (2005). [sent-28, score-0.231]

10 Our new SparseL2 Boost algorithm achieves a higher degree of sparsity while still being computationally feasible, in contrast to all subset selection in linear regression whose computational complexity would generally be exponential in the number of predictor variables. [sent-30, score-0.204]

11 For the special case of orthogonal linear models, we prove here an equivalence of SparseL2 Boost to Breiman’s (1995) nonnegative garrote estimator. [sent-31, score-0.282]

12 In particular, we demonstrate its use in the context of nonparametric second-order interaction modeling with a base procedure (weak learner) using thin plate splines, improving upon Friedman’s (1991) MARS. [sent-36, score-0.208]

13 Since our SparseL2 Boost is based on the ﬁnal prediction error criterion, it opens up the possibility of bypassing the computationally intensive cross-validation by stopping early based on the model selection score. [sent-37, score-0.147]

14 The gMDL model selection criterion (Hansen and Yu, 2001) uses a datadriven penalty to the L2 -loss and as a consequence bridges between the two well-known AIC and BIC criteria. [sent-38, score-0.158]

15 The boosting methodology in general builds on a user-determined base procedure or weak learner and uses it repeatedly on modiﬁed data which are typically outputs from the previous it1002 S PARSE B OOSTING erations. [sent-52, score-0.586]

16 For L2 Boosting, based on the squared error loss, one simply ﬁts the base procedure to the original data to start with, then uses the residuals from the previous iteration as the new response vector and reﬁts the base procedure, and so on. [sent-54, score-0.365]

17 The regularization of the empirical risk minimization comes in implicitly by the choice of a base procedure and by algorithmical constraints such as early stopping or penalty barriers. [sent-57, score-0.191]

18 1 Base Procedures Which Do Variable Selection To be more precise, a base procedure is in our setting a function estimator based on the data {(Xi ,Ui ); i = 1, . [sent-59, score-0.163]

19 We denote the base procedure function estimator by g(·) = g(X,U) (·), ˆ ˆ (1) where X = (X1 , . [sent-69, score-0.163]

20 In fact, almost all of the successful boosting algorithms in practice involve base procedures which do variable selection: examples include decision trees (see Freund and Schapire, 1996; Breiman, 1998; Friedman et al. [sent-78, score-0.587]

21 It will be useful to represent the base procedure estimator (at the observed predictors Xi ) as a hat-operator, mapping the (pseudo-) response to the ﬁtted values: H : U → (g(X,U) (X1 ), . [sent-80, score-0.233]

22 ˆ ˆ If the base procedure selects from a set of predictor variables, we denote the selected predictor ˆ ˆ variable index by S ⊂ {1, . [sent-87, score-0.352]

23 To emphasize this, we write for the hat operator of a base procedure ˆ H Sˆ : U → (g(X(Sˆ ) ,U) (X1 ), . [sent-91, score-0.222]

24 1003 ¨ B UHLMANN AND Y U Componentwise linear least squares in linear model (see Mallat and Zhang, 1993; B¨ hlmann, u 2006) ˆ We select only single variables at a time from Γ = {1, 2, . [sent-99, score-0.144]

25 The selector S chooses the predictor variable which reduces the residual sum of squares most when using least squares ﬁtting: ( j) n ∑n U Xi ˆ ˆ ˆ S = argmin1≤ j≤p ∑ (Ui − γ j Xi( j) )2 , γ j = i=1 i ( j) ( j = 1, . [sent-103, score-0.46]

26 i=1 ∑n (Xi )2 i=1 The base procedure is then ˆ ˆˆ g(X,U) (x) = γS x(S ) , ˆ and its hat operator is given by the matrix ˆ ˆ ( ( H Sˆ = X(S ) (X(S ) )T , X( j) = (X1 j) , . [sent-107, score-0.222]

27 L2 Boosting with this base procedure yields a linear model with model selection and parameter estimates which are shrunken towards zero. [sent-111, score-0.245]

28 Componentwise smoothing spline (see B¨ hlmann and Yu, 2003) u Similarly to a componentwise linear least squares ﬁt, we select only one single variable at a time ˆ from Γ = {1, 2, . [sent-115, score-0.433]

29 The selector S chooses the predictor variable which reduces residual sum of squares most when using a smoothing spline ﬁt. [sent-119, score-0.407]

30 That is, for a given smoothing spline operator with ﬁxed degrees of freedom d. [sent-120, score-0.16]

31 (which is the trace of the corresponding hat matrix) n ˆ ˆ S = argmin1≤ j≤p ∑ (Ui − g j (Xi( j) ))2 , i=1 g j (·) is the ﬁt from the smoothing spline to U versus X( j) with d. [sent-122, score-0.201]

32 The base procedure is ˆ g(X,U) (x) = gS (x(S ) ), ˆ ˆˆ and its hat operator is then given by a matrix H S . [sent-128, score-0.222]

33 Boosting with this base procedure yields an ˆ additive model ﬁt based on selected variables (see B¨ hlmann and Yu, 2003). [sent-129, score-0.259]

34 u Pairwise thin plate splines Generalizing the componentwise smoothing spline, we select pairs of variables from Γ = {( j, k); 1 ≤ ˆ j < k ≤ p}. [sent-130, score-0.281]

35 The selector S chooses the two predictor variables which reduce residual sum of squares most when using thin plate splines with two arguments: n ˆ ˆ S = argmin1≤ j < k. [sent-131, score-0.433]

36 The base procedure is ˆ g(X,U) (x) = gS (x(S ) ), ˆ ˆˆ ˆ ˆ where x(S ) denotes the 2-dimensional vector corresponding to the selected pair in S , and the hat operator is then given by a matrix H S . [sent-133, score-0.257]

37 Boosting with this base procedure yields a nonparametric ﬁt ˆ with second order interactions based on selected pairs of variables; an illustration is given in section 3. [sent-134, score-0.167]

38 The parameter values for the two terminal nodes in the stump are then given by ordinary least squares which implies a linear hat matrix H ( j,c j,k ) . [sent-151, score-0.219]

39 2 L2 Boosting Before introducing our new SparseL2 Boost algorithm, we describe ﬁrst its less sparse counterpart L2 Boosting, a boosting procedure based on the squared error loss which amounts to repeated ﬁtting of residuals with the base procedure g(X,U) (·). [sent-154, score-0.738]

40 The number of boosting iterations may be estimated by cross-validation. [sent-172, score-0.498]

41 3 SparseL2 Boost As described above, L2 Boosting proceeds in a greedy way: if in Step2 the base procedure is ﬁtted by least squares and when using ν = 1, L2 Boosting pursues the best reduction of residual sum of squares in every iteration. [sent-176, score-0.389]

42 Using the notation as in (2), the L2 Boosting operator in iteration m is easily shown to be (see B¨ hlmann and Yu, 2003) u B m = I − (I − νH Sˆm ) · · · · · (I − νH Sˆ1 ), (4) ˆ where S m denotes the selector in iteration m. [sent-181, score-0.264]

43 1 and decision tree base procedures), the boosting operator has a corresponding matrix-form when using in (4) the hat-matrices for H S . [sent-185, score-0.599]

44 The degrees of freedom for boosting are then deﬁned as trace(B m ) = trace(I − (I − νH S m ) · · · (I − νH S 1 )). [sent-186, score-0.544]

45 ˆ ˆ This is a standard deﬁnition for degrees of freedom (see Green and Silverman, 1994) and it has been used in the context of boosting in B¨ hlmann (2006). [sent-187, score-0.631]

46 1 T HE S PARSEL2 B OOST A LGORITHM For SparseL2 Boost, the penalized residual sum of squares in (5) becomes the criterion to move from iteration m − 1 to iteration m. [sent-191, score-0.305]

47 ˜ ˆ Fit the residuals Ui = Yi − Fm−1 (Xi ) with the base procedure using the selected S m which yields a function estimate fˆm (·) = gS m ;(X,U) (·), ˆ˜ where gS ;(X,U) (·) corresponds to the hat operator H S from the base procedure. [sent-202, score-0.392]

48 ˆ ˜ While S m in (3) minimizes the residual sum of squares, the selected S m in SparseL2 Boost minimizes ˜ a model selection criterion over all possible selectors. [sent-211, score-0.201]

49 Since the selector S m depends not only on the ˜ ˜ ˜ current residuals U but also explicitly on all previous boosting iterations through S 1 , S 2 , . [sent-212, score-0.62]

50 4 Connections to the Nonnegative Garrote Estimator SparseL2 Boost based on Cγ as in (7) enjoys a surprising equivalence to the nonnegative garrote estimator (Breiman, 1995) in an orthogonal linear model. [sent-218, score-0.338]

51 The nonnegative garrote estimator has been proposed by Breiman (1995) for a linear regression model to improve over subset selection. [sent-245, score-0.367]

52 It shrinks each ordinary least squares (OLS) estimated coefﬁcient by a nonnegative amount whose sum is subject to an upper bound constraint (the garrote). [sent-246, score-0.204]

53 For a given response vector Y and a design matrix X (see (9)), the nonnegative garrote estimator takes the form ˆ ˆ βNngar, j = c j βOLS, j such that n ˆ ∑ (Yi − (XβNngar )i )2 is minimized, subject to c j ≥ 0, i=1 p ∑ c j ≤ s, (10) j=1 for some s > 0. [sent-247, score-0.334]

54 In the orthonormal case from (8), since the ordinary least squares estimator is ˆ simply βOLS, j = (XT Y) j = Z j , the nonnegative garrote minimization problem becomes ﬁnding c j ’s such that n n ∑ (Z j − c j Z j )2 is minimized, subject to j=1 c j ≥ 0, ∑ c j ≤ s. [sent-248, score-0.441]

55 j (12) We show in Figure 1 the nonnegative garrote threshold function in comparison to hard- and softthresholding, the former corresponding to subset variable selection and the latter to the Lasso (Tibshirani, 1996). [sent-256, score-0.287]

56 1009 ¨ B UHLMANN AND Y U The following result shows the equivalence of the nonnegative garrote estimator and SparseL2 Boost ˆ ˆ (m) with componentwise linear least squares (using m iterations) yielding coefﬁcient estimates βSparseBoost, j . [sent-261, score-0.607]

57 For SparseL2 Boost with componentwise linear least squares, based on Cγn as in (7) and using a step-size ν, as described in section 2. [sent-263, score-0.171]

58 For the orthogonal case, Theorem 1 yields the interesting interpretation of SparseL2 Boost as the nonnegative garrote estimator. [sent-267, score-0.282]

59 We also describe here for the orthogonal case the equivalence of L2 Boosting with componentˆ (m) wise linear least squares (yielding coefﬁcient estimates βBoost, j ) to soft-thresholding. [sent-268, score-0.149]

60 For L2 Boosting with componentwise linear least squares and using a step-size ν, as described in section 2. [sent-275, score-0.285]

61 2, there exists a boosting iteration m, typically depending on λn , ν and the data, such that ˆ (m) ˆ βBoost, j = βso f t, j in (13) with threshold of the form λn (1 + e j (ν)), where max |e j (ν)| ≤ ν/(1 − ν) → 0 (ν → 0). [sent-276, score-0.514]

62 Moreover, the gMDL criterion has an explicit analytical expression which depends only on the residual sum of squares and the model dimension or complexity. [sent-291, score-0.24]

63 (14) n−k kS Here, RSS denotes again the residual sum of squares as in formula (6) (ﬁrst argument of the function C(·, ·)). [sent-294, score-0.168]

64 That is, we propose m = argmin1≤m≤MCgMDL (RSSm , trace(B m )), ˆ (15) where M is a large number, RSSm the residual sum of squares after m boosting iterations and B m is the boosting operator described in (4). [sent-300, score-1.179]

65 1), and we call such an automatically stopped boosting method gMDL-L2 Boosting. [sent-303, score-0.479]

66 The step-size in both boosting methods is ﬁxed at ν = 0. [sent-311, score-0.479]

67 Finally, both boosting methods are essentially insensitive when increasing the Σ , dim. [sent-329, score-0.479]

68 049) Table 1: Mean squared error (MSE), E[( fˆ(X) − f (X))2 ] ( f (x) = E[Y |X = x]), in model (16) for gMDL-SparseL2 Boost and gMDL early stopped L2 Boosting using the estimated stopping iteration m. [sent-378, score-0.179]

69 Moreover, for this model, SparseL2 Boost is a good model selector as long as the dimensionality is not very large, that is for p ∈ {50, 100}, while L2 Boosting is much worse selecting too many false predictors (that is too many false positives). [sent-475, score-0.142]

70 The gMDL-sel-L2 Boost method performs between the better and the worse of the two boosting algorithms, but closer to the better performer in each situation (the latter is only known for simulated data sets). [sent-532, score-0.479]

71 For model (19), there is essentially no degraded performance when doing a data-driven selection between the two boosting algorithms (in comparison to the best performer). [sent-533, score-0.549]

72 We used 10-fold cross-validation to estimate the out-of-sample squared prediction error and the average number of selected predictor variables. [sent-539, score-0.217]

73 10 Table 4: Boosting with componentwise linear least squares for ozone data with ﬁrst orderinteractions (n = 330, p = 45). [sent-551, score-0.325]

74 Squared prediction error and average number of selected predictor variables using 10-fold cross-validation. [sent-552, score-0.16]

75 24 18 Table 5: Boosting with componentwise linear least squares for ozone data with ﬁrst orderinteractions (n = 330, p = 45). [sent-559, score-0.325]

76 gMDL-score, n−1 × residual sum of squares (RSS) and number of selected terms (ℓ0 -norm). [sent-560, score-0.203]

77 We use L2 Boosting and SparseL2 Boost with componentwise ˆ linear least squares. [sent-572, score-0.171]

78 When using L2 - or SparseL2 Boost on Y ∈ {0, 1} directly, with an intercept term, we would obtain a shrunken boosting estimate of the intercept introducing a bias rendering p(·) to be systemˆ atically too small. [sent-578, score-0.589]

79 30 Table 6: Boosting with componentwise linear least squares for tumor classiﬁcation data (n = 46, p = 7129). [sent-590, score-0.321]

80 061 14 Table 8: Boosting with componentwise linear least squares for tumor classiﬁcation (n = 49, p = 7129). [sent-611, score-0.321]

81 BIC score, n−1 × residual sum of squares (RSS) and number of selected terms (ℓ0 -norm). [sent-612, score-0.203]

82 SparseL2 Boost and L2 Boosting with a pairwise thin plate spline, which selects the best pair of predictor variables yielding lowest residual sum of squares (when having the same degrees of freedom d. [sent-620, score-0.413]

83 But the estimation of the boosting iterations by gMDL did not do as well as in section 3. [sent-626, score-0.498]

84 Thus, when SNR is high, the log(F) is high too, leading to too small models in both SparseL2 Boost and L2 Boosting: that is, this large penalty forces both SparseL2 Boost and L2 Boosting to stop too early in comparison to the oracle stopping iteration which minimizes MSE. [sent-630, score-0.143]

85 However, both boosting methods nevertheless are quite a bit better than MARS. [sent-631, score-0.479]

86 Thus, for lower signal to noise ratios, stopping the boosting iterations with the gMDL criterion works very well, and our SparseL2 Boost algorithm is much better than MARS. [sent-636, score-0.617]

87 Moreover, it is computationally feasible in high dimensions, for example having linear complexity in the number of predictor variables p when using componentwise linear least squares or componentwise smoothing splines (see section 2. [sent-663, score-0.615]

88 The idea of sparse boosting could also be transferred to boosting algorithms with other loss functions, leading to sparser variants of AdaBoost and LogitBoost. [sent-666, score-1.034]

89 In our framework, the boosting approach automatically comes with a reasonable notion for statistical complexity or degrees of freedom, namely the trace of the boosting operator when it can be expressed in hat matrix form. [sent-672, score-1.164]

90 This trace complexity is well deﬁned for many popular base procedures (weak learners) including componentwise linear least squares and decision trees, see also section 2. [sent-673, score-0.452]

91 We represent the componentwise linear least squares base procedure as a ( j) ( j) hat operator H S with H j = x( j) (x( j) )T , where x( j) = (x1 , . [sent-682, score-0.507]

92 The ˆ L2 Boosting operator in iteration m is then given by the matrix B m = I − (I − νH 1 )m1 (I − νH 2 )m2 · · · (I − νH n )mn , where mi equals the number of times that the ith predictor variable has been selected during the m boosting iterations; and hence m = ∑n mi . [sent-687, score-0.918]

93 Therefore, the residual sum of squares in the mth boosting iteration is: RSSm = Y − B m Y 2 = XT Y − XT B m Y 2 = Z − Dm Z 2 = (I − Dm )Z 2 , where Z = XT Y. [sent-693, score-0.682]

94 Thus, every stopping of boosting with an iteration number m corresponds to a tolerance δ2 such that RSSk − RSSk+1 > δ2 , k = 1, 2, . [sent-696, score-0.571]

95 Since L2 Boosting changes only one of the summands in RSSm in the boosting iteration m + 1, the criterion in (22) implies that for all i ∈ {1, . [sent-700, score-0.556]

96 δ , 1−(1−ν)2 Since mi = 0 happens only if |Zi | ≤ √ δ 1−(1−ν)2 where λ = √ δ 1 − (1 − ν)2 |Zi | Zi if m1 ≥ 1, we can write the estimator as  Zi − λ,  ˆ (m) ≈ 0, βBoost,i   Zi + λ, if Zi ≥ λ, if |Zi | < λ, if Zi ≤ −λ. [sent-710, score-0.171]

97 This also implies uniqueness of the iteration m such that (28) holds or of the mi such that (29) holds. [sent-738, score-0.15]

98 = Zi − sign(Zi ) 2 )|Z |2 (1 − (1 − ν) 2 − ν |Zi | i The right-hand side is the nonnegative garrote estimator as in (12) with threshold γn /(2 − ν). [sent-742, score-0.303]

99 Additive logistic regression: a statistical view of boosting (with discussion). [sent-828, score-0.479]

100 On the bayes-risk consistency of regularized boosting methods (with discussion). [sent-862, score-0.479]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('boost', 0.63), ('boosting', 0.479), ('gmdl', 0.194), ('garrote', 0.181), ('componentwise', 0.171), ('rss', 0.133), ('zi', 0.129), ('mi', 0.115), ('rssm', 0.114), ('squares', 0.114), ('predictor', 0.105), ('oosting', 0.089), ('hlmann', 0.087), ('uhlmann', 0.087), ('base', 0.086), ('hat', 0.081), ('selector', 0.073), ('parse', 0.068), ('yu', 0.068), ('nonnegative', 0.066), ('breiman', 0.062), ('lasso', 0.06), ('trace', 0.059), ('stopping', 0.057), ('fm', 0.057), ('squared', 0.057), ('estimator', 0.056), ('residual', 0.054), ('aic', 0.053), ('bic', 0.052), ('sparser', 0.051), ('mse', 0.049), ('hansen', 0.049), ('residuals', 0.049), ('nngar', 0.048), ('criterion', 0.042), ('ozone', 0.04), ('selection', 0.04), ('predictors', 0.039), ('spline', 0.038), ('fpe', 0.038), ('msbm', 0.038), ('shrunken', 0.038), ('efron', 0.036), ('tumor', 0.036), ('cv', 0.036), ('intercept', 0.036), ('selected', 0.035), ('friedman', 0.035), ('orthogonal', 0.035), ('iteration', 0.035), ('regression', 0.034), ('operator', 0.034), ('xt', 0.034), ('freedom', 0.033), ('ineffective', 0.032), ('degrees', 0.032), ('splines', 0.031), ('response', 0.031), ('model', 0.03), ('vi', 0.029), ('plate', 0.029), ('fslr', 0.029), ('mars', 0.029), ('oost', 0.029), ('sparseboo', 0.029), ('thin', 0.027), ('penalty', 0.027), ('mallat', 0.027), ('nonparametric', 0.025), ('yi', 0.025), ('sparse', 0.025), ('penalized', 0.025), ('tsch', 0.025), ('genes', 0.025), ('sparsity', 0.025), ('gs', 0.025), ('green', 0.025), ('sel', 0.024), ('ordinary', 0.024), ('oracle', 0.024), ('ei', 0.023), ('smoothing', 0.023), ('procedures', 0.022), ('ols', 0.022), ('procedure', 0.021), ('breast', 0.021), ('dm', 0.021), ('forward', 0.02), ('interaction', 0.02), ('prediction', 0.02), ('silverman', 0.02), ('bagging', 0.02), ('snr', 0.02), ('signal', 0.02), ('yielding', 0.019), ('iterations', 0.019), ('caption', 0.019), ('cgmdl', 0.019), ('datadriven', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 83 jmlr-2006-Sparse Boosting

Author: Peter Bühlmann, Bin Yu

2 0.13582039 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

Author: Peter J. Bickel, Ya'acov Ritov, Alon Zakai

Abstract: We give a review of various aspects of boosting, clarifying the issues through a few simple results, and relate our work and that of others to the minimax paradigm of statistics. We consider the population version of the boosting algorithm and prove its convergence to the Bayes classiﬁer as a corollary of a general result about Gauss-Southwell optimization in Hilbert space. We then investigate the algorithmic convergence of the sample version, and give bounds to the time until perfect separation of the sample. We conclude by some results on the statistical optimality of the L2 boosting. Keywords: classiﬁcation, Gauss-Southwell algorithm, AdaBoost, cross-validation, non-parametric convergence rate

3 0.05609281 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to ﬁnd such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufﬁcient for Lasso to select the true model both in the classical ﬁxed p setting and in the large p setting as the sample size n gets large. Based on these results, sufﬁcient conditions that are veriﬁable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clariﬁed) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency

4 0.046229638 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

Author: Shai Shalev-Shwartz, Yoram Singer

Abstract: We discuss the problem of learning to rank labels from a real valued feedback associated with each label. We cast the feedback as a preferences graph where the nodes of the graph are the labels and edges express preferences over labels. We tackle the learning problem by deﬁning a loss function for comparing a predicted graph with a feedback graph. This loss is materialized by decomposing the feedback graph into bipartite sub-graphs. We then adopt the maximum-margin framework which leads to a quadratic optimization problem with linear constraints. While the size of the problem grows quadratically with the number of the nodes in the feedback graph, we derive a problem of a signiﬁcantly smaller size and prove that it attains the same minimum. We then describe an efﬁcient algorithm, called SOPOPO, for solving the reduced problem by employing a soft projection onto the polyhedron deﬁned by a reduced set of constraints. We also describe and analyze a wrapper procedure for batch learning when multiple graphs are provided for training. We conclude with a set of experiments which show signiﬁcant improvements in run time over a state of the art interior-point algorithm.

5 0.041754279 88 jmlr-2006-Streamwise Feature Selection

Author: Jing Zhou, Dean P. Foster, Robert A. Stine, Lyle H. Ungar

Abstract: In streamwise feature selection, new features are sequentially considered for addition to a predictive model. When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. Features can be generated dynamically, focusing the search for new features on promising subspaces, and overﬁtting can be controlled by dynamically adjusting the threshold for adding features to the model. In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. These two methods give false discovery rate style guarantees against overﬁtting. They differ from standard penalty methods such as AIC, BIC and RIC, which always drastically over- or under-ﬁt in the limit of inﬁnite numbers of non-predictive features. Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features. Keywords: classiﬁcation, stepwise regression, multiple regression, feature selection, false discovery rate

6 0.038321815 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)

7 0.037429769 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

8 0.035862602 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

9 0.0345609 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

10 0.034257066 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

11 0.034123205 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming (Special Topic on Machine Learning and Optimization)

12 0.033620667 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

13 0.029879188 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

14 0.029782202 75 jmlr-2006-Policy Gradient in Continuous Time

15 0.028303508 12 jmlr-2006-Active Learning with Feedback on Features and Instances

16 0.028301232 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

17 0.028107043 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

18 0.027631838 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

19 0.027438914 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

20 0.027074656 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.153), (1, -0.017), (2, -0.005), (3, -0.092), (4, -0.062), (5, 0.109), (6, -0.069), (7, 0.048), (8, 0.058), (9, -0.031), (10, -0.17), (11, 0.142), (12, -0.012), (13, 0.128), (14, 0.103), (15, 0.06), (16, -0.043), (17, 0.1), (18, 0.232), (19, -0.025), (20, 0.245), (21, 0.116), (22, -0.227), (23, -0.017), (24, -0.027), (25, -0.154), (26, 0.075), (27, -0.204), (28, -0.174), (29, 0.132), (30, 0.012), (31, 0.104), (32, 0.153), (33, 0.001), (34, 0.032), (35, -0.039), (36, -0.023), (37, -0.116), (38, 0.022), (39, 0.099), (40, 0.016), (41, 0.011), (42, -0.043), (43, 0.079), (44, 0.089), (45, -0.042), (46, 0.032), (47, -0.238), (48, -0.149), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9752298 83 jmlr-2006-Sparse Boosting

Author: Peter Bühlmann, Bin Yu

2 0.48578668 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

Author: Peter J. Bickel, Ya'acov Ritov, Alon Zakai

3 0.38661644 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

4 0.29213846 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)

Author: S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste

Abstract: Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classiﬁcation speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily ﬁnds a set of kernel basis functions of a speciﬁed maximum size (dmax ) to approximate the SVM primal cost function well; (3) it is 2 efﬁcient and roughly scales as O(ndmax ) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors. Keywords: SVMs, classiﬁcation, sparse design

5 0.24771217 88 jmlr-2006-Streamwise Feature Selection

Author: Jing Zhou, Dean P. Foster, Robert A. Stine, Lyle H. Ungar

6 0.24687222 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

7 0.24494138 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming (Special Topic on Machine Learning and Optimization)

8 0.19664091 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

9 0.184486 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

10 0.18265872 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

11 0.16969851 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

12 0.15228897 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

13 0.14533079 8 jmlr-2006-A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis

14 0.13079022 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

15 0.12534776 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

16 0.12446517 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals

17 0.12248079 45 jmlr-2006-Learning Coordinate Covariances via Gradients

18 0.12095882 49 jmlr-2006-Learning Parts-Based Representations of Data

19 0.11976916 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

20 0.11763787 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.026), (11, 0.01), (14, 0.395), (35, 0.02), (36, 0.075), (45, 0.031), (50, 0.057), (63, 0.029), (76, 0.027), (78, 0.029), (81, 0.026), (84, 0.025), (90, 0.041), (91, 0.034), (96, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77416897 83 jmlr-2006-Sparse Boosting

Author: Peter Bühlmann, Bin Yu

2 0.52052909 25 jmlr-2006-Distance Patterns in Structural Similarity

Author: Thomas Kämpke

Abstract: Similarity of edge labeled graphs is considered in the sense of minimum squared distance between corresponding values. Vertex correspondences are established by isomorphisms if both graphs are of equal size and by subisomorphisms if one graph has fewer vertices than the other. Best ﬁt isomorphisms and subisomorphisms amount to solutions of quadratic assignment problems and are computed exactly as well as approximately by minimum cost ﬂow, linear assignment relaxations and related graph algorithms. Keywords: assignment problem, best approximation, branch and bound, inexact graph matching, model data base

3 0.35247943 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

Author: Peter J. Bickel, Ya'acov Ritov, Alon Zakai

4 0.30855998 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to inﬁnity. We derive probabilistic ﬁnite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reﬂecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a signiﬁcant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

5 0.30352038 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

Abstract: We introduce an algorithm that simultaneously estimates a classiﬁcation function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to ﬁnd salient variables and estimate how they covary. An efﬁcient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classiﬁcation function and its gradient to the true classiﬁcation function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classiﬁcation

6 0.29982239 70 jmlr-2006-Online Passive-Aggressive Algorithms

7 0.29782346 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

8 0.29499727 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

9 0.29478204 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

10 0.29199976 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

11 0.29053712 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

12 0.28865579 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

13 0.2874347 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

14 0.28737009 66 jmlr-2006-On Model Selection Consistency of Lasso

15 0.28630358 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

16 0.28630355 53 jmlr-2006-Learning a Hidden Hypergraph

17 0.28626028 48 jmlr-2006-Learning Minimum Volume Sets

18 0.28548518 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

19 0.2848807 44 jmlr-2006-Large Scale Transductive SVMs

20 0.28477991 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models (Special Topic on Machine Learning and Optimization)