jmlr jmlr2010 jmlr2010-77 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner
Abstract: We describe version 2.0 of the R add-on package mboost. The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data. Keywords: component-wise functional gradient descent, splines, decision trees 1. Overview The R add-on package mboost (Hothorn et al., 2010) implements tools for fitting and evaluating a variety of regression and classification models that have been suggested in machine learning and statistics. Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. The algorithm originates from the statistical view on boosting algorithms (Friedman et al., 2000; Bühlmann and Yu, 2003). The theory and its implementation in mboost allow for fitting complex prediction models, taking potentially many interactions of features into account, as well as for fitting additive and linear models. The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. H OTHORN , B ÜHLMANN , K NEIB , S CHMID AND H OFNER a subset of the features into account. Special cases are linear models f (x) = x⊤ β, additive models f (x) = ∑ p f j (x( j) ), where f j is a function of the jth feature x( j) only (smooth functions or j=1 stumps, for example) or a more complex function where f (x) is implicitly defined as the sum of multiple decision trees including higher-order interactions. The latter case corresponds to boosting with trees. Combinations of these structures are also possible. The most important advantage of such a decomposition of the regression function is that each component of a fitted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. The characteristic ξ of the distribution depends on the measurement scale of the response Y and the scientific question to be answered. For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. The definition of ξ is determined by defining a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i.e., limited number of boosting iterations). The model is then fitted using n p ( fˆ1 , . . . , fˆp ) = argmin ∑ wi ρ yi , ∑ f j (x) . ( f1 ,..., f p ) i=1 j=1 Here (yi , xi ), i = 1, . . . , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. The component-wise boosting algorithm starts with some offset for f and iteratively fits residuals defined by the negative gradient of the loss function evaluated at the current fit by updating only the best model component in each iteration. The details have been described by Bühlmann and Yu (2003). Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j defines the final model. A more thorough introduction to boosting with applications in statistics based on version 1.0 of mboost is given by Bühlmann and Hothorn (2007). As of version 2.0, the package allows for fitting models to binary, numeric, ordered and censored responses, that is, regression of the mean, robust regression, classification (logistic and exponential loss), ordinal regression,1 quantile1 and expectile1 regression, censored regression (including Cox, Weibull1 , log-logistic1 or lognormal1 models) as well as Poisson and negative binomial regression1 for count data can be performed. Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be fitted (e.g., in geoadditive regression, Kneib et al., 2009). 2. Design and Implementation The package incorporates an infrastructure for representing loss functions (so-called ‘families’), base-learners defining the structure of the regression function and thus the model components f j , and a generic implementation of component-wise functional gradient descent. The main progress in version 2.0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. In comparison to the 1.0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more unified way. Specifically, the new code basis results in an enhanced and more user-friendly formula interface. In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. 1. Model family is new in version 2.0 or was added after the release of mboost 1.0. 2110 M ODEL - BASED B OOSTING 2.0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e.g., for fitting smooth functions via penalized splines, varying coefficients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. As a major improvement over the 1.0 series, computations on larger data sets (both with respect to the number of observations and the number of variables) are now facilitated by memory efficient implementations of the base-learners, mostly by applying sparse matrix techniques (package Matrix, Bates and Mächler, 2009) and parallelization for a cross-validation-based choice of the number of boosting iterations (per default via package multicore, Urbanek, 2009). A more elaborate description of mboost 2.0 features is available from the mboost vignette.2 3. User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). The aim is to describe the DXA measurements as a function of the anthropometric features. Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al., 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. For the data (available as data frame bodyfat), the model structure is specified via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). The loss function (here, the check function for the 0.5 quantile) along with its negative gradient function are defined by the QuantReg(0.5) family (Fenske et al., 2009). The model structure (specified using the formula fm), the data and the family are then passed to function mboost() for model fitting:3 R> library(
Reference: text
sentIndex sentText sentNum sentScore
1 DE Institut für Medizininformatik, Biometrie und Epidemiologie FAU Erlangen-Nürnberg DE-91054 Erlangen Editor: Mikio Braun Abstract We describe version 2. [sent-17, score-0.021]
2 The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data. [sent-19, score-0.648]
3 Keywords: component-wise functional gradient descent, splines, decision trees 1. [sent-20, score-0.026]
4 Overview The R add-on package mboost (Hothorn et al. [sent-21, score-0.566]
5 , 2010) implements tools for fitting and evaluating a variety of regression and classification models that have been suggested in machine learning and statistics. [sent-22, score-0.108]
6 Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. [sent-23, score-0.047]
7 The algorithm originates from the statistical view on boosting algorithms (Friedman et al. [sent-24, score-0.184]
8 The theory and its implementation in mboost allow for fitting complex prediction models, taking potentially many interactions of features into account, as well as for fitting additive and linear models. [sent-26, score-0.526]
9 The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). [sent-27, score-0.404]
10 In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). [sent-28, score-0.156]
11 Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. [sent-29, score-0.028]
12 The most important advantage of such a decomposition of the regression function is that each component of a fitted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. [sent-34, score-0.137]
13 For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. [sent-36, score-0.033]
14 The definition of ξ is determined by defining a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i. [sent-37, score-0.047]
15 , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. [sent-50, score-0.036]
16 The component-wise boosting algorithm starts with some offset for f and iteratively fits residuals defined by the negative gradient of the loss function evaluated at the current fit by updating only the best model component in each iteration. [sent-51, score-0.212]
17 Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j defines the final model. [sent-53, score-0.077]
18 A more thorough introduction to boosting with applications in statistics based on version 1. [sent-54, score-0.184]
19 0 of mboost is given by Bühlmann and Hothorn (2007). [sent-55, score-0.406]
20 Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be fitted (e. [sent-58, score-0.086]
21 0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. [sent-65, score-0.21]
22 Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. [sent-66, score-0.096]
23 0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more unified way. [sent-68, score-0.081]
24 Specifically, the new code basis results in an enhanced and more user-friendly formula interface. [sent-69, score-0.045]
25 In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. [sent-70, score-0.028]
26 0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e. [sent-76, score-0.148]
27 , for fitting smooth functions via penalized splines, varying coefficients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. [sent-78, score-0.066]
28 User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. [sent-84, score-0.224]
29 (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. [sent-85, score-0.287]
30 In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). [sent-86, score-0.134]
31 The aim is to describe the DXA measurements as a function of the anthropometric features. [sent-87, score-0.058]
32 Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al. [sent-88, score-0.176]
33 , 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. [sent-89, score-0.509]
34 For the data (available as data frame bodyfat), the model structure is specified via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). [sent-90, score-0.142]
35 The complete R code for reproducing this example is given in the accompanying file mboost-MLOSS. [sent-98, score-0.046]
36 0 100 110 120 130 hipcirc anthro3b Figure 1: Out-of-bag empirical risk (A) indicating that 500 iterations are appropriate. [sent-112, score-0.202]
37 Fitted model components for variable anthro3b, consisting of a linear (B) and smooth term (C). [sent-113, score-0.09]
38 The right panel shows the interaction model component between hip and waist circumferences. [sent-114, score-0.287]
39 5)) ### median regression Once the model has been fitted it is important to assess the appropriate number of boosting iterations via the out-of-sample empirical risk. [sent-116, score-0.36]
40 They indicate that a model based on three components, including a smooth function of anthro3b and a bivariate function of hip and waist circumference, provides the best characterization of the median body fat composition (given the model specification offered to the boosting algorithm). [sent-118, score-0.817]
41 A hip circumference larger than 110 cm leads to increased body fat but only if the waist circumference is larger than 90 cm. [sent-119, score-0.665]
42 The sources of the mboost package are distributed at the Comprehensive R Archive Network under GPL-2, along with binaries for all major platforms as well as documentation and regression tests. [sent-120, score-0.626]
43 Boosting algorithms: Regularization, prediction and model fitting (with discussion). [sent-133, score-0.028]
44 Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. [sent-139, score-0.379]
45 Additive logistic regression: a statistical view of boosting (with discussion). [sent-147, score-0.184]
46 Improved prediction of body fat by measuring skinfold thickness, circumferences, and bone breadths. [sent-152, score-0.226]
47 Variable selection and model choice in geoadditive regression models. [sent-162, score-0.146]
wordName wordTfidf (topN-words)
[('mboost', 0.406), ('bbs', 0.29), ('bols', 0.29), ('hothorn', 0.232), ('kneib', 0.232), ('torsten', 0.232), ('df', 0.223), ('hlmann', 0.192), ('boosting', 0.184), ('package', 0.16), ('schmid', 0.145), ('fat', 0.123), ('circumference', 0.116), ('hip', 0.116), ('hipcirc', 0.116), ('waist', 0.116), ('waistcirc', 0.116), ('additive', 0.096), ('cvm', 0.089), ('chmid', 0.087), ('erlangen', 0.087), ('ofner', 0.087), ('oldenburg', 0.087), ('othorn', 0.087), ('statistik', 0.087), ('splines', 0.082), ('institut', 0.082), ('matthias', 0.082), ('body', 0.078), ('nchen', 0.074), ('fm', 0.062), ('benjamin', 0.062), ('thomas', 0.061), ('regression', 0.06), ('anthropometric', 0.058), ('bates', 0.058), ('bodyfat', 0.058), ('btree', 0.058), ('censored', 0.058), ('dexfat', 0.058), ('dxa', 0.058), ('elbowbreadth', 0.058), ('fenske', 0.058), ('fpartial', 0.058), ('geoadditive', 0.058), ('imbe', 0.058), ('kneebreadth', 0.058), ('mstop', 0.058), ('multicore', 0.058), ('neib', 0.058), ('oosting', 0.058), ('quantreg', 0.058), ('center', 0.057), ('composition', 0.056), ('quantile', 0.052), ('tted', 0.05), ('peter', 0.05), ('garcia', 0.05), ('tting', 0.049), ('median', 0.049), ('risk', 0.047), ('age', 0.041), ('smooth', 0.039), ('iterations', 0.039), ('bootstrap', 0.039), ('med', 0.038), ('responses', 0.036), ('uni', 0.036), ('odel', 0.034), ('numeric', 0.033), ('url', 0.032), ('model', 0.028), ('penalized', 0.027), ('interaction', 0.027), ('models', 0.026), ('trees', 0.026), ('bone', 0.025), ('seminar', 0.025), ('obesity', 0.025), ('women', 0.025), ('accompanying', 0.025), ('ada', 0.025), ('formula', 0.024), ('interactions', 0.024), ('components', 0.023), ('cpus', 0.022), ('print', 0.022), ('stumps', 0.022), ('cores', 0.022), ('infrastructure', 0.022), ('mathematik', 0.022), ('corinna', 0.022), ('douglas', 0.022), ('gerhard', 0.022), ('implements', 0.022), ('code', 0.021), ('parallelization', 0.021), ('eth', 0.021), ('looked', 0.021), ('und', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 77 jmlr-2010-Model-based Boosting 2.0
Author: Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner
Abstract: We describe version 2.0 of the R add-on package mboost. The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data. Keywords: component-wise functional gradient descent, splines, decision trees 1. Overview The R add-on package mboost (Hothorn et al., 2010) implements tools for fitting and evaluating a variety of regression and classification models that have been suggested in machine learning and statistics. Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. The algorithm originates from the statistical view on boosting algorithms (Friedman et al., 2000; Bühlmann and Yu, 2003). The theory and its implementation in mboost allow for fitting complex prediction models, taking potentially many interactions of features into account, as well as for fitting additive and linear models. The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. H OTHORN , B ÜHLMANN , K NEIB , S CHMID AND H OFNER a subset of the features into account. Special cases are linear models f (x) = x⊤ β, additive models f (x) = ∑ p f j (x( j) ), where f j is a function of the jth feature x( j) only (smooth functions or j=1 stumps, for example) or a more complex function where f (x) is implicitly defined as the sum of multiple decision trees including higher-order interactions. The latter case corresponds to boosting with trees. Combinations of these structures are also possible. The most important advantage of such a decomposition of the regression function is that each component of a fitted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. The characteristic ξ of the distribution depends on the measurement scale of the response Y and the scientific question to be answered. For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. The definition of ξ is determined by defining a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i.e., limited number of boosting iterations). The model is then fitted using n p ( fˆ1 , . . . , fˆp ) = argmin ∑ wi ρ yi , ∑ f j (x) . ( f1 ,..., f p ) i=1 j=1 Here (yi , xi ), i = 1, . . . , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. The component-wise boosting algorithm starts with some offset for f and iteratively fits residuals defined by the negative gradient of the loss function evaluated at the current fit by updating only the best model component in each iteration. The details have been described by Bühlmann and Yu (2003). Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j defines the final model. A more thorough introduction to boosting with applications in statistics based on version 1.0 of mboost is given by Bühlmann and Hothorn (2007). As of version 2.0, the package allows for fitting models to binary, numeric, ordered and censored responses, that is, regression of the mean, robust regression, classification (logistic and exponential loss), ordinal regression,1 quantile1 and expectile1 regression, censored regression (including Cox, Weibull1 , log-logistic1 or lognormal1 models) as well as Poisson and negative binomial regression1 for count data can be performed. Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be fitted (e.g., in geoadditive regression, Kneib et al., 2009). 2. Design and Implementation The package incorporates an infrastructure for representing loss functions (so-called ‘families’), base-learners defining the structure of the regression function and thus the model components f j , and a generic implementation of component-wise functional gradient descent. The main progress in version 2.0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. In comparison to the 1.0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more unified way. Specifically, the new code basis results in an enhanced and more user-friendly formula interface. In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. 1. Model family is new in version 2.0 or was added after the release of mboost 1.0. 2110 M ODEL - BASED B OOSTING 2.0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e.g., for fitting smooth functions via penalized splines, varying coefficients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. As a major improvement over the 1.0 series, computations on larger data sets (both with respect to the number of observations and the number of variables) are now facilitated by memory efficient implementations of the base-learners, mostly by applying sparse matrix techniques (package Matrix, Bates and Mächler, 2009) and parallelization for a cross-validation-based choice of the number of boosting iterations (per default via package multicore, Urbanek, 2009). A more elaborate description of mboost 2.0 features is available from the mboost vignette.2 3. User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). The aim is to describe the DXA measurements as a function of the anthropometric features. Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al., 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. For the data (available as data frame bodyfat), the model structure is specified via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). The loss function (here, the check function for the 0.5 quantile) along with its negative gradient function are defined by the QuantReg(0.5) family (Fenske et al., 2009). The model structure (specified using the formula fm), the data and the family are then passed to function mboost() for model fitting:3 R> library(
2 0.043504424 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley
Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-fitting, regularization, minimum description length
3 0.035600454 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
Author: Gavin C. Cawley, Nicola L. C. Talbot
Abstract: Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a nonnegligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds. Keywords: model selection, performance evaluation, bias-variance trade-off, selection bias, overfitting
4 0.034006912 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
5 0.022363875 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
6 0.022188975 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
7 0.021620274 22 jmlr-2010-Classification Using Geometric Level Sets
8 0.018879158 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
9 0.018748704 25 jmlr-2010-Composite Binary Losses
10 0.017300818 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
11 0.01711938 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
12 0.016960772 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
13 0.016741684 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
14 0.016505761 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
15 0.016170541 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
16 0.016162062 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
17 0.015983272 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
18 0.015482388 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
19 0.015414472 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency
20 0.01537232 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting
topicId topicWeight
[(0, -0.078), (1, -0.001), (2, 0.004), (3, 0.014), (4, -0.017), (5, 0.047), (6, 0.039), (7, -0.019), (8, -0.056), (9, -0.001), (10, 0.028), (11, -0.044), (12, -0.054), (13, 0.002), (14, -0.024), (15, -0.022), (16, -0.001), (17, -0.022), (18, -0.036), (19, -0.029), (20, -0.045), (21, -0.036), (22, 0.026), (23, 0.055), (24, -0.093), (25, -0.034), (26, -0.041), (27, 0.09), (28, -0.128), (29, 0.038), (30, -0.083), (31, 0.064), (32, 0.023), (33, -0.053), (34, -0.201), (35, -0.082), (36, -0.083), (37, -0.011), (38, -0.01), (39, 0.23), (40, 0.093), (41, -0.014), (42, -0.269), (43, 0.097), (44, 0.133), (45, 0.27), (46, -0.104), (47, -0.323), (48, -0.119), (49, -0.116)]
simIndex simValue paperId paperTitle
same-paper 1 0.94947171 77 jmlr-2010-Model-based Boosting 2.0
Author: Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner
Abstract: We describe version 2.0 of the R add-on package mboost. The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data. Keywords: component-wise functional gradient descent, splines, decision trees 1. Overview The R add-on package mboost (Hothorn et al., 2010) implements tools for fitting and evaluating a variety of regression and classification models that have been suggested in machine learning and statistics. Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. The algorithm originates from the statistical view on boosting algorithms (Friedman et al., 2000; Bühlmann and Yu, 2003). The theory and its implementation in mboost allow for fitting complex prediction models, taking potentially many interactions of features into account, as well as for fitting additive and linear models. The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. H OTHORN , B ÜHLMANN , K NEIB , S CHMID AND H OFNER a subset of the features into account. Special cases are linear models f (x) = x⊤ β, additive models f (x) = ∑ p f j (x( j) ), where f j is a function of the jth feature x( j) only (smooth functions or j=1 stumps, for example) or a more complex function where f (x) is implicitly defined as the sum of multiple decision trees including higher-order interactions. The latter case corresponds to boosting with trees. Combinations of these structures are also possible. The most important advantage of such a decomposition of the regression function is that each component of a fitted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. The characteristic ξ of the distribution depends on the measurement scale of the response Y and the scientific question to be answered. For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. The definition of ξ is determined by defining a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i.e., limited number of boosting iterations). The model is then fitted using n p ( fˆ1 , . . . , fˆp ) = argmin ∑ wi ρ yi , ∑ f j (x) . ( f1 ,..., f p ) i=1 j=1 Here (yi , xi ), i = 1, . . . , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. The component-wise boosting algorithm starts with some offset for f and iteratively fits residuals defined by the negative gradient of the loss function evaluated at the current fit by updating only the best model component in each iteration. The details have been described by Bühlmann and Yu (2003). Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j defines the final model. A more thorough introduction to boosting with applications in statistics based on version 1.0 of mboost is given by Bühlmann and Hothorn (2007). As of version 2.0, the package allows for fitting models to binary, numeric, ordered and censored responses, that is, regression of the mean, robust regression, classification (logistic and exponential loss), ordinal regression,1 quantile1 and expectile1 regression, censored regression (including Cox, Weibull1 , log-logistic1 or lognormal1 models) as well as Poisson and negative binomial regression1 for count data can be performed. Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be fitted (e.g., in geoadditive regression, Kneib et al., 2009). 2. Design and Implementation The package incorporates an infrastructure for representing loss functions (so-called ‘families’), base-learners defining the structure of the regression function and thus the model components f j , and a generic implementation of component-wise functional gradient descent. The main progress in version 2.0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. In comparison to the 1.0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more unified way. Specifically, the new code basis results in an enhanced and more user-friendly formula interface. In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. 1. Model family is new in version 2.0 or was added after the release of mboost 1.0. 2110 M ODEL - BASED B OOSTING 2.0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e.g., for fitting smooth functions via penalized splines, varying coefficients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. As a major improvement over the 1.0 series, computations on larger data sets (both with respect to the number of observations and the number of variables) are now facilitated by memory efficient implementations of the base-learners, mostly by applying sparse matrix techniques (package Matrix, Bates and Mächler, 2009) and parallelization for a cross-validation-based choice of the number of boosting iterations (per default via package multicore, Urbanek, 2009). A more elaborate description of mboost 2.0 features is available from the mboost vignette.2 3. User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). The aim is to describe the DXA measurements as a function of the anthropometric features. Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al., 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. For the data (available as data frame bodyfat), the model structure is specified via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). The loss function (here, the check function for the 0.5 quantile) along with its negative gradient function are defined by the QuantReg(0.5) family (Fenske et al., 2009). The model structure (specified using the formula fm), the data and the family are then passed to function mboost() for model fitting:3 R> library(
2 0.34269807 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley
Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-fitting, regularization, minimum description length
3 0.25706163 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
Author: Gavin C. Cawley, Nicola L. C. Talbot
Abstract: Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a nonnegligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds. Keywords: model selection, performance evaluation, bias-variance trade-off, selection bias, overfitting
4 0.25363335 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra
Abstract: Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining first and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classification for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically significant gains in classification accuracy, with a reduction in overall classification error from 26%-28% to 19%. Keywords: regularization, classification, bilinear decomposition, neural signals, brain computer interface
5 0.21376579 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting
Author: Philippos Mordohai, Gérard Medioni
Abstract: We address instance-based learning from a perceptual organization standpoint and present methods for dimensionality estimation, manifold learning and function approximation. Under our approach, manifolds in high-dimensional spaces are inferred by estimating geometric relationships among the input instances. Unlike conventional manifold learning, we do not perform dimensionality reduction, but instead perform all operations in the original input space. For this purpose we employ a novel formulation of tensor voting, which allows an N-D implementation. Tensor voting is a perceptual organization framework that has mostly been applied to computer vision problems. Analyzing the estimated local structure at the inputs, we are able to obtain reliable dimensionality estimates at each instance, instead of a global estimate for the entire data set. Moreover, these local dimensionality and structure estimates enable us to measure geodesic distances and perform nonlinear interpolation for data sets with varying density, outliers, perturbation and intersections, that cannot be handled by state-of-the-art methods. Quantitative results on the estimation of local manifold structure using ground truth data are presented. In addition, we compare our approach with several leading methods for manifold learning at the task of measuring geodesic distances. Finally, we show competitive function approximation results on real data. Keywords: dimensionality estimation, manifold learning, geodesic distance, function approximation, high-dimensional processing, tensor voting
6 0.19978969 63 jmlr-2010-Learning Instance-Specific Predictive Models
7 0.19905761 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
8 0.18940328 93 jmlr-2010-PyBrain
9 0.18759941 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
10 0.18421264 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
11 0.18123303 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate
12 0.1593354 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
13 0.15706688 80 jmlr-2010-On-Line Sequential Bin Packing
14 0.15612358 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
15 0.15317783 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
16 0.15044811 6 jmlr-2010-A Rotation Test to Verify Latent Structure
17 0.15000281 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
18 0.14708699 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
19 0.14493465 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
20 0.14305584 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation
topicId topicWeight
[(3, 0.012), (32, 0.016), (36, 0.011), (37, 0.023), (75, 0.798), (85, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.99592024 77 jmlr-2010-Model-based Boosting 2.0
Author: Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner
Abstract: We describe version 2.0 of the R add-on package mboost. The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data. Keywords: component-wise functional gradient descent, splines, decision trees 1. Overview The R add-on package mboost (Hothorn et al., 2010) implements tools for fitting and evaluating a variety of regression and classification models that have been suggested in machine learning and statistics. Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. The algorithm originates from the statistical view on boosting algorithms (Friedman et al., 2000; Bühlmann and Yu, 2003). The theory and its implementation in mboost allow for fitting complex prediction models, taking potentially many interactions of features into account, as well as for fitting additive and linear models. The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. H OTHORN , B ÜHLMANN , K NEIB , S CHMID AND H OFNER a subset of the features into account. Special cases are linear models f (x) = x⊤ β, additive models f (x) = ∑ p f j (x( j) ), where f j is a function of the jth feature x( j) only (smooth functions or j=1 stumps, for example) or a more complex function where f (x) is implicitly defined as the sum of multiple decision trees including higher-order interactions. The latter case corresponds to boosting with trees. Combinations of these structures are also possible. The most important advantage of such a decomposition of the regression function is that each component of a fitted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. The characteristic ξ of the distribution depends on the measurement scale of the response Y and the scientific question to be answered. For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. The definition of ξ is determined by defining a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i.e., limited number of boosting iterations). The model is then fitted using n p ( fˆ1 , . . . , fˆp ) = argmin ∑ wi ρ yi , ∑ f j (x) . ( f1 ,..., f p ) i=1 j=1 Here (yi , xi ), i = 1, . . . , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. The component-wise boosting algorithm starts with some offset for f and iteratively fits residuals defined by the negative gradient of the loss function evaluated at the current fit by updating only the best model component in each iteration. The details have been described by Bühlmann and Yu (2003). Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j defines the final model. A more thorough introduction to boosting with applications in statistics based on version 1.0 of mboost is given by Bühlmann and Hothorn (2007). As of version 2.0, the package allows for fitting models to binary, numeric, ordered and censored responses, that is, regression of the mean, robust regression, classification (logistic and exponential loss), ordinal regression,1 quantile1 and expectile1 regression, censored regression (including Cox, Weibull1 , log-logistic1 or lognormal1 models) as well as Poisson and negative binomial regression1 for count data can be performed. Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be fitted (e.g., in geoadditive regression, Kneib et al., 2009). 2. Design and Implementation The package incorporates an infrastructure for representing loss functions (so-called ‘families’), base-learners defining the structure of the regression function and thus the model components f j , and a generic implementation of component-wise functional gradient descent. The main progress in version 2.0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. In comparison to the 1.0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more unified way. Specifically, the new code basis results in an enhanced and more user-friendly formula interface. In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. 1. Model family is new in version 2.0 or was added after the release of mboost 1.0. 2110 M ODEL - BASED B OOSTING 2.0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e.g., for fitting smooth functions via penalized splines, varying coefficients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. As a major improvement over the 1.0 series, computations on larger data sets (both with respect to the number of observations and the number of variables) are now facilitated by memory efficient implementations of the base-learners, mostly by applying sparse matrix techniques (package Matrix, Bates and Mächler, 2009) and parallelization for a cross-validation-based choice of the number of boosting iterations (per default via package multicore, Urbanek, 2009). A more elaborate description of mboost 2.0 features is available from the mboost vignette.2 3. User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). The aim is to describe the DXA measurements as a function of the anthropometric features. Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al., 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. For the data (available as data frame bodyfat), the model structure is specified via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). The loss function (here, the check function for the 0.5 quantile) along with its negative gradient function are defined by the QuantReg(0.5) family (Fenske et al., 2009). The model structure (specified using the formula fm), the data and the family are then passed to function mboost() for model fitting:3 R> library(
2 0.99405676 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
Author: Christian R. Shelton, Yu Fan, William Lam, Joon Lee, Jing Xu
Abstract: We present a continuous time Bayesian network reasoning and learning engine (CTBN-RLE). A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuoustime Markov process. This software provides libraries and programs for most of the algorithms developed for CTBNs. For learning, CTBN-RLE implements structure and parameter learning for both complete and partial data. For inference, it implements exact inference and Gibbs and importance sampling approximate inference for any type of evidence pattern. Additionally, the library supplies visualization methods for graphically displaying CTBNs or trajectories of evidence. Keywords: continuous time Bayesian networks, C++, open source software
3 0.99071354 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
4 0.99024636 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda
Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, fixed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA
5 0.98239195 6 jmlr-2010-A Rotation Test to Verify Latent Structure
Author: Patrick O. Perry, Art B. Owen
Abstract: In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure of non-Gaussianity in the histogram of the principal eigenvector components instead of on the eigenvalue. The method finds and verifies some latent dichotomies in the microarray data from the AGEMAP consortium. Keywords: independent components analysis, Kronecker covariance, latent variables, projection pursuit, transposable data
6 0.98055255 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
7 0.91451365 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
8 0.91229755 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
9 0.90648454 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
10 0.90109122 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
11 0.88988662 84 jmlr-2010-On Spectral Learning
12 0.88643986 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
13 0.88569355 40 jmlr-2010-Fast and Scalable Local Kernel Machines
14 0.88457674 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
15 0.88277555 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
16 0.88131839 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
17 0.87922251 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
18 0.87818718 63 jmlr-2010-Learning Instance-Specific Predictive Models
19 0.87724483 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
20 0.8728866 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming