jmlr jmlr2007 jmlr2007-77 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peng Zhao, Bin Yu
Abstract: Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between fidelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA Editor: Saharon Rosset Abstract Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. [sent-5, score-0.112]
2 When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. [sent-7, score-0.157]
3 BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function (L1 penalized convex loss). [sent-10, score-0.311]
4 It consists of both a forward step and a backward step. [sent-11, score-0.188]
5 The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. [sent-12, score-0.265]
6 In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. [sent-13, score-0.389]
7 Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. [sent-15, score-0.128]
8 Since the (Generalized) BLasso relies only on differences not derivatives, we conclude that it provides a class of simple and easy-to-implement algorithms for tracing the regularization or solution paths of penalized minimization problems. [sent-16, score-0.124]
9 Keywords: backward step, boosting, convexity, Lasso, regularization path 1. [sent-17, score-0.191]
10 Introduction Many statistical machine learning algorithms minimize either an empirical loss function or a penalized empirical loss, in a regression or a classification setting where covariate or predictor variables are used to predict a response variable and i. [sent-18, score-0.106]
11 Z HAO AND Y U exponential loss function of the margin, while SVM a penalized hinge loss function of the margin. [sent-24, score-0.123]
12 It minimizes the L 2 loss function with an L1 penalty on the parameters in a linear regression model. [sent-46, score-0.108]
13 The L1 penalty leads to sparse solutions, that is, there are few predictors or basis functions with nonzero weights (among all possible choices). [sent-48, score-0.105]
14 BLasso generates approximately the Lasso path in general situations for both regression and classification for L 1 penalized convex loss function. [sent-71, score-0.17]
15 This step uses the same minimization rule as the forward step to define each fitting stage but with an extra rule to force the model complexity or L 1 penalty to decrease. [sent-76, score-0.189]
16 By combining backward and forward steps, BLasso is able to go back and forth to approximate the Lasso path correctly. [sent-77, score-0.233]
17 This way BLasso can deal with different loss functions and large classes of base learners like trees, wavelets and splines by fitting a base learner at each step and aggregating the base learners as the algorithm progresses. [sent-81, score-0.263]
18 Moreover, the solution path of BLasso can be shown to converge to that of the Lasso, which uses explicit global L 1 regularization for cases with a finite number of base learners. [sent-82, score-0.125]
19 In contrast, e-Boosting or FSF can be seen as local regularization in the sense that at any iteration, FSF with a fixed small stepsize only searches over those models which are one small step away from the current one in all possible directions corresponding to the base learners or predictors (cf. [sent-83, score-0.325]
20 First, by introducing the backward step we modify e-Boosting or FSF to follow the Lasso path and consequently generate models that are sparser with equivalent or slightly better prediction performance in our simulations (with true sparse models and more predictors than the sample size). [sent-87, score-0.254]
21 For such problems, BLasso operates in a similar fashion as FSF or eBoosting but, unlike FSF, BLasso can be shown to converge to the Lasso path quite generally when the stepsize goes to zero. [sent-96, score-0.213]
22 1, the gradient view of Boosting is provided and FSF (e-Boosting) is reviewed as a coordinatewise descent method with a fixed stepsize on the L2 loss. [sent-99, score-0.217]
23 Section 3 introduces BLasso that is a coordinate descent algorithm with a fixed stepsize applied to the Lasso minimization problem. [sent-102, score-0.227]
24 Section 4 discusses the backward step and gives the intuition behind BLasso and explains why FSF is unable to give the Lasso path. [sent-103, score-0.123]
25 BLasso is shown as a learning algorithm that gives sparse models and good prediction and as a simple plug-in method for approximating the regularization path for different convex loss functions and penalties. [sent-106, score-0.181]
26 Moreover, we compare different choices of the stepsize and give evidence for the regularization 2703 Z HAO AND Y U effect of using moderate stepsizes. [sent-107, score-0.2]
27 More recently, boosting has been interpreted as a gradient descent algorithm on an empirical loss function. [sent-112, score-0.115]
28 FSF or e-Boosting can be viewed as a gradient descent with a fixed small stepsize at each stage and it produces solutions that are often close to the Lasso solutions (path). [sent-113, score-0.252]
29 β i=1 Despite the fact that the empirical loss function is often convex in β, exact minimization is usually a formidable task for a moderately rich function family of base learners and with such function families the exact minimization leads to overfitted models. [sent-130, score-0.183]
30 Because the family of base learners is usually large, Boosting can be viewed as finding approximate solutions by applying functional gradient descent. [sent-131, score-0.122]
31 Instead of optimizing the stepsize as in ˆ (2), FSF updates βt by a fixed stepsize ε as in Friedman (2001). [sent-142, score-0.314]
32 Very efficient algorithms have been proposed to give the entire regularization path for the squared loss function (the homotopy method by Osborne et al. [sent-184, score-0.159]
33 However, it remains open how to give the entire regularization path of the Lasso problem for general convex loss function. [sent-188, score-0.148]
34 FSF exists as a compromise since, like Boosting, it is a nonparametric learning algorithm that works with different loss functions and large numbers of base learners (predictors) but it is local regularization and does not converge to the Lasso path in general. [sent-189, score-0.23]
35 In contrast to FSF, BLasso converges to the Lasso path for general convex loss functions when the stepsize goes to 0. [sent-194, score-0.272]
36 This algorithm has two related input parameters, a stepsize ε and a tolerance level ξ. [sent-198, score-0.178]
37 ) We will discuss forward and backward steps in depth in the next section. [sent-201, score-0.187]
38 , n and a small stepsize constant ε > 0 and a small tolerance parameter ξ > 0, take an initial forward step n ˆ ˆ ( j, s jˆ) = arg min j,s=±ε ∑ L(Zi ; s1 j ), i=1 ˆ β0 = s jˆ1 jˆ, ˆ Then calculate the initial regularization parameter n 1 n ˆ λ0 = ( ∑ L(Zi ; 0) − ∑ L(Zi ; β0 )). [sent-217, score-0.31]
39 (5) ∑ j t j∈IA i=1 Take the step if it leads to a decrease of moderate size ξ in the Lasso loss, otherwise force a forward step (as (3), (4) in FSF) and relax λ if necessary: ˆ ˆ If Γ(βt + s jˆ1 jˆ; λt ) − Γ(βt , λt ) ≤ −ξ, then ˆ ˆ ˆ βt+1 = βt + s jˆ1 jˆ, λt+1 = λt . [sent-222, score-0.126]
40 The stepsize ε controls fineness of the grid BLasso runs on. [sent-229, score-0.166]
41 The tolerance ξ controls how large a descend need to be made for a backward step to be taken. [sent-230, score-0.162]
42 For a finite number of base learners and ξ = o(ε), if ∑ L(Zi ; β) is strongly convex with bounded second derivatives in β then as ε → 0, the BLasso path converges to the Lasso path uniformly. [sent-233, score-0.243]
43 This is because if all the optimal coefficient paths are monotone, then BLasso will never take a backward step, so it will be equivalent to e-Boosting. [sent-236, score-0.13]
44 Theorem 1 does not cover nonparametric learning problems with an infinite number of base learners either. [sent-246, score-0.107]
45 In fact, for problems with large or infinite number of base learners, the minimization in (6) is usually done approximately by functional gradient descent and a tolerance ξ > 0 needs to be chosen to avoid oscillation between forward and backward steps caused by slow descending. [sent-247, score-0.324]
46 Without the backward step, when FSF picks up more irrelevant variables as compared to the Lasso path in some cases (cf. [sent-254, score-0.158]
47 As seen below, this backward step naturally arises in BLasso because of our coordinate descent view of the minimization of the Lasso loss. [sent-257, score-0.193]
48 Therefore the first step of BLasso is a forward step because minimizing Lasso loss is equivalent to minimizing the L2 loss due to the same positive change of the L1 penalty. [sent-265, score-0.175]
49 In general, to minimize the Lasso loss, one needs to go “back and forth” to trade off the penalty with the empirical loss for different regularization parameters. [sent-269, score-0.13]
50 ˆ To be precise, for a given β, a backward step is such that: ˆ ˆ ˆ ∆β = s j 1 j , subject to β j = 0, sign(s) = −sign(β j ) and |s| = ε. [sent-270, score-0.123]
51 While forward steps try to reduce the Lasso loss through minimizing the empirical loss, the backward steps try to reduce the Lasso loss through minimizing the Lasso penalty. [sent-272, score-0.275]
52 In summary, by allowing the backward steps, we are able to work with the Lasso loss directly and take backward steps to correct earlier forward steps that might have picked up irrelevant variables. [sent-273, score-0.343]
53 It is straightforward to see that in LS problems both forward and backward steps in BLasso are based only on the correlations between fitted residuals and the covariates (predictors). [sent-279, score-0.234]
54 It follows that BLasso in this case reduces to finding the best direction in both forward and backward steps by examining the inner-products, and then deciding whether to go forward or backward based on the regularization parameter. [sent-280, score-0.387]
55 In nonparametric situations, the number of base learners is large therefore the aforementioned strategy becomes inefficient. [sent-284, score-0.107]
56 For the backward step, only inner-products between base learners that have entered the model need to be calculated. [sent-286, score-0.188]
57 This makes the backward steps’ computation complexity proportional to the number of base learners that are already chosen instead of the number of all possible base learners. [sent-288, score-0.224]
58 For nonparametric learning problems with a large or an infinite number of base learners, we believe BLasso is an attractive strategy for approximating the path of the Lasso, as it shares the same computational strategy as Boosting which has proven itself successful in applications. [sent-294, score-0.123]
59 For the Lasso problem, BLasso does a fixed stepsize coordinate descent to minimize the penalized loss. [sent-298, score-0.252]
60 Since the penalty has the special L 1 norm and (7) holds, a step’s impact on the penalty has a fixed size ε with either a positive or a negative sign, and the coordinate descent takes form of “backward” and “forward” steps. [sent-299, score-0.177]
61 This reduces the minimization of the penalized loss function to unregularized minimizations of the loss function as in (6) and (5). [sent-300, score-0.145]
62 However, the mechanism remains the same—minimize the penalized loss function for each λ, relax the regularization by reducing λ through a “forward” step when the minimum of the loss function for the current λ is reached. [sent-305, score-0.175]
63 , n and a fixed small stepsize ε > 0 and a small tolerance parameter ξ ≥ 0, take an initial forward step n ˆ ∑ L(Zi ; s1 j ), β0 = sˆjˆ1 jˆ. [sent-316, score-0.264]
64 Find the steepest coordinate descent direction on the penalized loss: ˆ ˆ ˆ ( j, s jˆ) = arg min Γ(βt + s1 j ; λt ). [sent-320, score-0.108]
65 same as the Lasso path which shrinks the added irrelevant variable back to zero, while FSF’s path parts drastically from Lasso’s due to the added strongly correlated covariate and does not move it back to zero. [sent-326, score-0.17]
66 Moreover, we find that when the stepsize increases, there is a regularization effect in terms of both prediction and sparsity, for both BLasso and FSF. [sent-330, score-0.199]
67 2711 Z HAO AND Y U The last experiment is to illustrate BLasso as an off-the-shelf method for computing the regularization path for general convex loss functions and general convex penalties. [sent-331, score-0.173]
68 i=1 The middle panel of Figure 1 shows the coefficient path plot for BLasso applied to the modified diabetes data. [sent-364, score-0.104]
69 5 small so that the coefficient paths are smooth), 840 of which were backward steps. [sent-368, score-0.13]
70 BLasso’s backward steps concentrate mainly around the steps where FSF and BLasso tend to differ. [sent-370, score-0.142]
71 Left Panel: Lasso solution paths (produced using simplex search method on the penalized empirical loss function for each λ) as a function of t = β 1 . [sent-377, score-0.106]
72 This overwhelming pattern of significant negative difference suggests that, for this simulation, BLasso is better than Lasso and FSF in terms of both prediction and sparsity unless the stepsize is very small as in the last two columns. [sent-718, score-0.188]
73 Moreover, for MSE the stepsize ε = 1/10 seems to bring the best improvement of BLasso over Lasso, and the improvement is pretty robust against the choice of stepsize. [sent-719, score-0.157]
74 Hence these improvements reflect the gains only by the backward steps since FSF takes also forward steps. [sent-721, score-0.187]
75 It considers a linear (L2 ) regression problem with Lγ penalty for γ ≥ 1 (to maintain the convexity of the penalty function). [sent-737, score-0.137]
76 To demonstrate the Generalized BLasso algorithm for classification using an nondifferentiable loss function with a L1 penalty function, we look at binary classification with the hinge loss. [sent-744, score-0.122]
77 The behavior relative to Lasso is due to the discrete stepsize of BLasso, while the behavior relative to FSF is partially explained by its convergence to the Lasso path as the stepsize goes to 0. [sent-794, score-0.37]
78 is also effective as an off-the-shelf algorithm for the general convex penalized loss minimization problems. [sent-801, score-0.122]
79 Depending on the actual loss function, base learners and minimization method used in each step, the actual computation complexity varies. [sent-803, score-0.139]
80 As shown in the simulations, choosing a smaller stepsize gives a smoother solution path but it does not guarantee a better prediction. [sent-804, score-0.213]
81 For the diabetes data, using a moderate stepsize ε = 0. [sent-809, score-0.193]
82 Moreover, even when the stepsize is as large as ε = 10 and ε = 50, the solutions are still good approximations. [sent-811, score-0.179]
83 BLasso has only one stepsize parameter (with the exception of the numerical tolerance ξ which is implementation specific but not necessarily a user parameter). [sent-812, score-0.186]
84 This parameter controls both how close BLasso approximates the minimization coefficients for each λ and how close two adjacent λ on the regularization path are placed. [sent-813, score-0.117]
85 As can be seen from Figure 5, a smaller stepsize leads to a closer approximation to the solutions and also finer grids for λ. [sent-814, score-0.179]
86 The algorithm assumes twice-differentiability of both 2719 Z HAO AND Y U loss function and penalty function and involves calculation of the Hessian matrix which could be heavy-duty computationally when the number p of covariates is not small. [sent-820, score-0.135]
87 BLasso’s stepsize is defined in the original parameter space which makes the solutions evenly spread in β’s space rather than in λ. [sent-825, score-0.179]
88 On the other hand, when the model is close to empty and the penalty function is very small, λ is very large, but the algorithm still uses the same small steps thus computation is spent to generate solutions that are too close to each other. [sent-827, score-0.105]
89 Therefore, the backward step’s computation complexity is limited because it only involves base learners that are already included from previous steps. [sent-838, score-0.188]
90 There is, however, a difference in the BLasso algorithm between the case with a small number of base learners and that with a large or an infinite number of base learners. [sent-839, score-0.122]
91 For the finite case, BLasso avoids oscillation by requiring a backward step to be strictly descending and relax λ whenever no descending step is available. [sent-840, score-0.163]
92 In the nonparametric learning case, a different kind of oscillation can occur when BLasso keeps going back and force in different directions but only improving the penalized loss function by a diminishing amount, therefore a positive tolerance ξ is mandatory. [sent-842, score-0.13]
93 Since BLasso has both forward and backward steps, we believe that an adaptive online learning algorithm can be devised based BLasso so that it goes back and forth to track the best regularization parameter and the corresponding model. [sent-845, score-0.21]
94 By combining both forward and backward steps, the BLasso algorithm is constructed to minimize an L1 penalized convex loss function. [sent-847, score-0.27]
95 The backward steps introduced in this paper are critical for producing the Lasso path. [sent-851, score-0.122]
96 Without them, the FSF algorithm in general does not produce Lasso solutions, especially when the base learners are strongly correlated as in cases where the number of base learners is larger than the number of observations. [sent-852, score-0.193]
97 We generalized BLasso as a simple, easy-to-implement, plug-in method for approximating the regularization path for other convex penalties. [sent-855, score-0.135]
98 Since a backward step is only taken when Γ(βt+1 ; λt ) < Γ(βt ; λt ) − ξ and λt+1 = λt , so we ˆ only need to consider forward steps. [sent-875, score-0.188]
99 1 claims that “the BLasso path converges to the Lasso path uniformly” for ∑ L(Z; β) that is strongly convex with bounded second derivatives in β. [sent-883, score-0.157]
100 The sparsity and bias of the lasso selection in high dimensional linear regression. [sent-1117, score-0.337]
wordName wordTfidf (topN-words)
[('blasso', 0.772), ('fsf', 0.448), ('lasso', 0.315), ('stepsize', 0.157), ('backward', 0.102), ('mse', 0.07), ('forward', 0.065), ('penalty', 0.063), ('zi', 0.058), ('path', 0.056), ('asso', 0.055), ('tagewise', 0.055), ('hao', 0.051), ('learners', 0.05), ('penalized', 0.044), ('covariates', 0.038), ('descent', 0.037), ('base', 0.036), ('loss', 0.034), ('regularization', 0.033), ('stepsizes', 0.032), ('tj', 0.031), ('osborne', 0.031), ('efron', 0.03), ('boosting', 0.03), ('stagewise', 0.029), ('predictors', 0.028), ('homotopy', 0.028), ('paths', 0.028), ('diabetes', 0.026), ('rosset', 0.025), ('convex', 0.025), ('sparser', 0.024), ('fitting', 0.024), ('solutions', 0.022), ('sparsity', 0.022), ('panel', 0.022), ('tolerance', 0.021), ('nonparametric', 0.021), ('bridge', 0.021), ('step', 0.021), ('steps', 0.02), ('minimization', 0.019), ('donoho', 0.019), ('lars', 0.018), ('meinshausen', 0.017), ('covariate', 0.017), ('sign', 0.017), ('penalties', 0.016), ('simulation', 0.016), ('zhu', 0.015), ('gradient', 0.014), ('nondifferentiable', 0.014), ('unregularized', 0.014), ('zhao', 0.014), ('sparse', 0.014), ('coordinate', 0.014), ('squares', 0.014), ('svm', 0.013), ('arg', 0.013), ('adaboost', 0.012), ('strongly', 0.012), ('coef', 0.012), ('hinge', 0.011), ('generalized', 0.011), ('hastie', 0.011), ('yu', 0.011), ('regression', 0.011), ('buhlmann', 0.01), ('oscillation', 0.01), ('tted', 0.01), ('approximating', 0.01), ('cients', 0.01), ('recovery', 0.01), ('forth', 0.01), ('moderate', 0.01), ('added', 0.01), ('ia', 0.009), ('prediction', 0.009), ('correlated', 0.009), ('annal', 0.009), ('attractiveness', 0.009), ('candes', 0.009), ('coordinatewise', 0.009), ('descend', 0.009), ('eboosting', 0.009), ('gedeon', 0.009), ('guilherme', 0.009), ('mechanic', 0.009), ('relax', 0.009), ('controls', 0.009), ('tuning', 0.009), ('residuals', 0.009), ('tting', 0.009), ('friedman', 0.009), ('derivatives', 0.008), ('shrinkage', 0.008), ('numerical', 0.008), ('freund', 0.008), ('squared', 0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 77 jmlr-2007-Stagewise Lasso
Author: Peng Zhao, Bin Yu
Abstract: Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between fidelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences
2 0.048945773 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd
Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.
3 0.033035025 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
Author: Wei Pan, Xiaotong Shen
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage
4 0.031833362 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
Author: David Mease, Abraham J. Wyner, Andreas Buja
Abstract: The standard by which binary classifiers are usually judged, misclassification error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classification trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We first examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overfit P[y = 1|x] even though they perform well as classifiers. A major negative point of the present article is the disconnect between class probability estimation and classification. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries. We then use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratification, data jittering
5 0.023743881 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
Author: András György, Tamás Linder, Gábor Lugosi, György Ottucsák
Abstract: The on-line shortest path problem is considered under various models of partial monitoring. Given a weighted directed acyclic graph whose edge weights can change in an arbitrary (adversarial) way, a decision maker has to choose in each round of a game a path between two distinguished vertices such that the loss of the chosen path (defined as the sum of the weights of its composing edges) be as small as possible. In a setting generalizing the multi-armed bandit problem, after choosing a path, the decision maker learns only the weights of those edges that belong to the chosen path. For this problem, an algorithm is given whose average cumulative loss in n rounds exceeds that of the best path, matched off-line to the entire sequence of the edge weights, by a quantity that is √ proportional to 1/ n and depends only polynomially on the number of edges of the graph. The algorithm can be implemented with complexity that is linear in the number of rounds n (i.e., the average complexity per round is constant) and in the number of edges. An extension to the so-called label efficient setting is also given, in which the decision maker is informed about the weights of the edges corresponding to the chosen path at a total of m n time instances. Another extension is shown where the decision maker competes against a time-varying path, a generalization of the problem of tracking the best expert. A version of the multi-armed bandit setting for shortest path is also discussed where the decision maker learns only the total weight of the chosen path but not the weights of the individual edges on the path. Applications to routing in packet switched networks along with simulation results are also presented. Keywords: on-line learning, shortest path problem, multi-armed bandit problem c 2007 Andr´ s Gy¨ rgy, Tam´ s Linder, G´ bor Lugosi and Gy¨ rgy Ottucs´ k. a o a a o a ¨ ´ G Y ORGY, L INDER , L UGOSI AND OTTUCS AK
6 0.023525359 90 jmlr-2007-Value Regularization and Fenchel Duality
7 0.02277224 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
8 0.022245388 9 jmlr-2007-AdaBoost is Consistent
9 0.018510865 52 jmlr-2007-Margin Trees for High-dimensional Classification
10 0.018445456 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
11 0.015209091 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
12 0.014540706 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
13 0.014453442 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
14 0.012616349 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation
15 0.012519067 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
16 0.01251539 23 jmlr-2007-Concave Learners for Rankboost
17 0.012514556 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)
18 0.012399517 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
19 0.012150259 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
20 0.01187234 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
topicId topicWeight
[(0, 0.083), (1, -0.002), (2, -0.008), (3, 0.012), (4, 0.056), (5, 0.002), (6, -0.08), (7, 0.059), (8, 0.028), (9, 0.019), (10, 0.025), (11, -0.068), (12, 0.008), (13, -0.008), (14, 0.019), (15, 0.049), (16, 0.06), (17, -0.063), (18, -0.049), (19, -0.123), (20, -0.063), (21, -0.13), (22, -0.023), (23, 0.011), (24, 0.038), (25, 0.067), (26, -0.018), (27, 0.201), (28, 0.127), (29, 0.198), (30, -0.173), (31, 0.23), (32, -0.093), (33, -0.315), (34, -0.34), (35, -0.2), (36, 0.353), (37, -0.107), (38, -0.066), (39, -0.039), (40, -0.162), (41, -0.263), (42, -0.171), (43, 0.147), (44, -0.097), (45, 0.172), (46, 0.084), (47, 0.079), (48, -0.096), (49, 0.13)]
simIndex simValue paperId paperTitle
same-paper 1 0.97550499 77 jmlr-2007-Stagewise Lasso
Author: Peng Zhao, Bin Yu
Abstract: Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between fidelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences
2 0.25662893 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd
Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.
3 0.15765563 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
Author: David Mease, Abraham J. Wyner, Andreas Buja
Abstract: The standard by which binary classifiers are usually judged, misclassification error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classification trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We first examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overfit P[y = 1|x] even though they perform well as classifiers. A major negative point of the present article is the disconnect between class probability estimation and classification. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries. We then use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratification, data jittering
4 0.12322135 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
Author: Wei Pan, Xiaotong Shen
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage
5 0.12061498 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation
Author: Masashi Sugiyama, Matthias Krauledat, Klaus-Robert Müller
Abstract: A common assumption in supervised learning is that the input points in the training set follow the same probability distribution as the input points that will be given in the future test phase. However, this assumption is not satisfied, for example, when the outside of the training region is extrapolated. The situation where the training input points and test input points follow different distributions while the conditional distribution of output values given input points is unchanged is called the covariate shift. Under the covariate shift, standard model selection techniques such as cross validation do not work as desired since its unbiasedness is no longer maintained. In this paper, we propose a new method called importance weighted cross validation (IWCV), for which we prove its unbiasedness even under the covariate shift. The IWCV procedure is the only one that can be applied for unbiased classification under covariate shift, whereas alternatives to IWCV exist for regression. The usefulness of our proposed method is illustrated by simulations, and furthermore demonstrated in the brain-computer interface, where strong non-stationarity effects can be seen between training and test sessions. Keywords: covariate shift, cross validation, importance sampling, extrapolation, brain-computer interface
6 0.1081976 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
7 0.090289064 23 jmlr-2007-Concave Learners for Rankboost
8 0.07875599 90 jmlr-2007-Value Regularization and Fenchel Duality
9 0.078716099 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
10 0.077995285 44 jmlr-2007-Large Margin Semi-supervised Learning
11 0.07318297 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution
12 0.071466595 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms
13 0.069503374 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
14 0.068031475 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
15 0.06473954 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
16 0.060471363 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
17 0.058303189 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
18 0.054628979 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
19 0.054456696 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
20 0.054225653 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
topicId topicWeight
[(4, 0.011), (8, 0.012), (10, 0.024), (11, 0.377), (12, 0.022), (15, 0.01), (24, 0.011), (28, 0.068), (40, 0.067), (45, 0.013), (48, 0.025), (60, 0.032), (80, 0.02), (85, 0.066), (98, 0.093), (99, 0.018)]
simIndex simValue paperId paperTitle
Author: Charles Sutton, Andrew McCallum, Khashayar Rohanimanesh
Abstract: In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data. In addition to maximum conditional likelihood, we present two alternative approaches for training DCRFs: marginal likelihood training, for when we are primarily interested in predicting only a subset of the variables, and cascaded training, for when we have a distinct data set for each state variable, as in transfer learning. We evaluate marginal training and cascaded training on both synthetic data and real-world text data, finding that marginal training can improve accuracy when uncertainty exists over the latent variables, and that for transfer learning, a DCRF trained in a cascaded fashion performs better than a linear-chain CRF that predicts the final task directly. Keywords: conditional random fields, graphical models, sequence labeling
same-paper 2 0.69085681 77 jmlr-2007-Stagewise Lasso
Author: Peng Zhao, Bin Yu
Abstract: Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between fidelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences
3 0.35961071 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd
Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.
Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire
Abstract: We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling
5 0.35005605 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
Author: Wei Pan, Xiaotong Shen
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage
6 0.34811836 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
7 0.3469764 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
8 0.34656721 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
10 0.34420452 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
12 0.34305024 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
13 0.34276128 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
14 0.34187245 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
15 0.34133565 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
16 0.34127021 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
17 0.34047711 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
18 0.33985782 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
19 0.33926994 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
20 0.33760607 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study