jmlr jmlr2010 jmlr2010-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley
Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-fitting, regularization, minimum description length
Reference: text
sentIndex sentText sentNum sentScore
1 We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! [sent-17, score-0.329]
2 Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-fitting, regularization, minimum description length 1. [sent-18, score-0.628]
3 Introduction The problem of learning is often decomposed into the tasks of fitting parameters to some training data, and then selecting the best model using heuristic or principled methods, collectively referred to as model selection methods. [sent-19, score-0.334]
4 Model selection methods range from simple yet powerful crossvalidation based methods to the optimization of cost functions penalized for model complexity, derived from performance bounds or Bayesian priors. [sent-20, score-0.238]
5 G UYON , S AFFARI , D ROR AND C AWLEY This paper is not intended as a general review of the state-of-the-art in model selection nor a tutorial; instead it is a synthesis of the collection of papers that we have assembled. [sent-22, score-0.24]
6 It also provides a unifying perspective on Bayesian and frequentist methodologies used in various model selection methods. [sent-23, score-0.484]
7 We highlight a new trend in research on model selection that blends these approaches. [sent-24, score-0.2]
8 The reader is expected to have some basic knowledge of familiar learning machines (linear models, neural networks, tree classifiers and kernel methods) and elementary notions of learning theory (bias/variance tradeoff, model capacity or complexity, performance bounds). [sent-25, score-0.23]
9 Novice readers are directed to the companion paper (Guyon, 2009), which reviews basic learning machines, common model selection techniques, and provides elements of learning theory. [sent-26, score-0.2]
10 When we started organizing workshops and competitions around the problem of model selection (of which this collection of papers is the product), both theoreticians and practitioners welcomed us with some scepticism; model selection being often viewed as somewhat “old hat”. [sent-27, score-0.481]
11 For Bayesian theoreticians, the problem of model selection is circumvented by averaging all models over the posterior distribution. [sent-29, score-0.328]
12 For risk minimization theoreticians (called “frequentists” by the Bayesians) the problem is solved by minimizing performance bounds. [sent-30, score-0.226]
13 However, looking more closely, most theoretically grounded methods of solving or circumventing model selection have at least one hyper-parameter left somewhere, which ends up being optimized by cross-validation. [sent-32, score-0.241]
14 After explaining in Section 2 our notational conventions, we briefly review a range of different Bayesian and frequentist approaches to model selection in Section 3, which we then unify in Section 4 under the framework of multi-level optimization. [sent-39, score-0.484]
15 Notations and Conventions In its broadest sense, model selection designates an ensemble of techniques used to select a model, that best explains some data or phenomena, or best predicts future data, observations or the consequences of actions. [sent-44, score-0.376]
16 There are several formulations of the supervised learning problem: • Function approximation (induction) methods seek a function f (called model or learning machine) belonging to a model class F , which minimizes a specified risk functional (or maxR imizes a certain utility). [sent-69, score-0.288]
17 • Bayesian and ensemble methods make predictions according to model averages that are convex combinations of models f ∈ F , that is, which belong to the convex closure of the model class F ∗ . [sent-73, score-0.37]
18 Bagging ensemble methods approximate ED ( f (x, D)), where f (x, D) is a function from the model class F , trained with m examples and ED (·) is the mathematical expectation over all training sets of size m. [sent-77, score-0.31]
19 Some of the other aspects of model selection will be discussed in Section 6. [sent-83, score-0.2]
20 The parametrization of f differentiates the problem of model selection from the general machine learning problem. [sent-84, score-0.2]
21 Instead of parameterizing f with one set of parameters, the model selection framework distinguishes between parameters and hyper-parameters. [sent-85, score-0.2]
22 For instance, for a linear model f (x, w) = wT x, α = w; for a kernel method f (x, α) = ∑k αk K(x, xk ), α = [αk ]. [sent-89, score-0.236]
23 We also refer to the parameters of the prior P( f ) in Bayesian/MAP learning and the parameters of the regularizer Ω[ f ] in risk minimization as hyper-parameters even if the resulting predictor is not an explicit function of those parameters, because they are used in the process of learning. [sent-96, score-0.328]
24 In what follows, we relate the problem of model selection to that of hyper-parameter selection, taken in is broadest sense and encompassing all the cases mentioned above. [sent-97, score-0.2]
25 The corresponding empirical estimates of the expected risk R[ f ], denoted Rtr [ f ], Rva [ f ], and Rte [ f ], will be called respectively training error, validation error, and test error. [sent-102, score-0.206]
26 The Many Faces of Model Selection In this section, we track model selection from various angles to finally reduce it to the unified view of multilevel inference. [sent-104, score-0.246]
27 For instance, in the model class of kernel methods f (x) = ∑k αk K(x, xk ; θ), why couldn’t we treat both α and θ as regular parameters? [sent-111, score-0.236]
28 Consider for example the Gaussian redial basis function kernel K(x, xk ; θ) = exp(− x − xk 2 /θ2 ). [sent-114, score-0.259]
29 k=1 o In addition, the capacity of f (x) = ∑m αo K(x, xo ; θ) of parameter θ for fixed values αo and xk k=1 k k k is very low (to see that, note that very few examples can be learned without error by just varying o the kernel width, given fixed vectors xk and fixed parameters αo ). [sent-119, score-0.304]
30 Using a Lagrange multiplier, the problem may be replaced by that of minimizing a regularized risk functional Rreg = Rtr + γ w 2 , γ > 0, where the training loss function is the so-called “hinge loss” (see e. [sent-128, score-0.255]
31 Interestingly, each method stems from a different theoretical justification (some are Bayesian, some are frequentist and some a a little bit of both like PAC-Bayesian bounds, see, for example, Seeger, 2003, for a review), showing a beautiful example of theory convergence (Guyon, 2009). [sent-141, score-0.284]
32 2 Bayesian Model Selection In the Bayesian framework, there is no model selection per se, since learning does not involve searching for an optimum function, but averaging over a posterior distribution. [sent-145, score-0.328]
33 (1) Bayesian model selection decomposes the prior P(α, θ) into parameter prior P(α|θ) and a “hyper-prior” P(θ). [sent-151, score-0.3]
34 Model selection is the problem of adjusting the capacity or complexity of the models to the available amount of training data to avoid either under-fitting or over-fitting. [sent-160, score-0.235]
35 Solving the performance prediction problem would also solve the model selection problem, but model selection is an easier problem. [sent-161, score-0.4]
36 , 1992) or boosting (Freund and Schapire, 1996), optimize a guaranteed risk rather than the empirical risk Rtr [ f ], and therefore provide some guarantee of good generalization. [sent-167, score-0.338]
37 Algorithms derived in this way have an embedded model selection mechanism. [sent-168, score-0.264]
38 Common practice among frequentists is to split available training data into mtr training examples to adjust parameters and mva validation examples to adjust hyper-parameters. [sent-171, score-0.252]
39 67 G UYON , S AFFARI , D ROR AND C AWLEY (ii) Ensemble methods: The method “train” returns either a single model or a linear combination of models, so the formalism can include all ensemble methods. [sent-207, score-0.248]
40 We propose in the next section a new classification of multi-level inference methods, orthogonal to the classical Bayesian versus frequentist divide, referring to the way in which data are processed rather than the means by which they are processed. [sent-208, score-0.342]
41 We categorize multi-level inference modules, each implementing one level of inference, into filter, wrapper, and embedded methods, borrowing from the conventional classification of feature selection methods (Kohavi and John, 1997; Blum and Langley, 1997; Guyon et al. [sent-211, score-0.29]
42 Such methods include preprocessing, feature construction, kernel design, architecture design, choice of prior or regularizers, choice of a noise model, and filter methods for feature selection. [sent-214, score-0.242]
43 Several top-ranking participants to challenges we organized used PCA, including Neal and Zhang (2006), winners of the NIPS 2003 feature selection challenge, and Lutz (2006), winner of the WCCI 2006 performance prediction 5. [sent-233, score-0.247]
44 However, with respect to model selection, the selection of preprocessing happens generally in the “outer loop” of selection, hence it is at the highest level. [sent-235, score-0.243]
45 The 2-norm regularizer Ω[ f ] = f 2 for kernel ridge regression, Support H Vector Machines (SVM) and Least-Square Support Vector Machines (LSSVM) have been applied with success by many top-ranking participants of the challenges we organized. [sent-257, score-0.241]
46 This regularizer was used in the winning entry of Radford Neal in the NIPS 2003 feature selection challenge (Neal and Zhang, 2006). [sent-260, score-0.339]
47 While the prior (or the regularizer) embeds our prior or domain knowledge of the model class, the likelihood (or the loss function) embeds our prior knowledge of the noise model on the predicted variable y. [sent-266, score-0.337]
48 A simple but effective means of using a noise model is to generate additional training data by distorting given training examples. [sent-278, score-0.239]
49 Because the learning machines in the wrapper setting are “black boxes”, we cannot sample directly from the posterior distribution P( f (θ)|D) (or according to exp −RD [ f (θ)] or exp −r[ f (θ)]). [sent-299, score-0.205]
50 Wrappers for feature selection use all sort of techniques, but sei=1 quential forward selection or backward elimination methods are most popular (Guyon et al. [sent-305, score-0.296]
51 For frequentist approaches, the most frequently used evaluation function is the cross-validation estimator. [sent-317, score-0.321]
52 These estimators may be poor predictors of the actual learning machine performances, but they are decent model selection indices, provided that the same data splits are used to compute the evaluation function for all models. [sent-325, score-0.237]
53 In the frequentist framework, the choice of a prior is replaced by the choice of a regularized functional. [sent-354, score-0.383]
54 Those are two-part evaluation functions including the empirical risk (or the negative log-likelihood) and a regularizer (or a prior). [sent-355, score-0.274]
55 (2008) propose a multiple kernel learning (MKL) method, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices, which can be brought back to a convex program. [sent-361, score-0.207]
56 1 Ensemble Methods In Section 4, we have made an argument in favor of unifying model selection and ensemble methods, stemming either from a Bayesian or frequentist perspective, in the common framework of multilevel optimization. [sent-379, score-0.706]
57 3, we have given examples of model selection and ensemble methods following filter, wrapper or embedded strategies. [sent-383, score-0.513]
58 While this categorization has the advantage of erasing the dogmatic origins of algorithms, it blurs some of the important differences between model selection and ensemble methods. [sent-384, score-0.376]
59 Ensemble methods can be thought of as a way of circumventing model selection by voting among models rather than choosing a single model. [sent-385, score-0.241]
60 Arguably, model selection algorithms will remain important in applications where model simplicity and data understanding prevail, but ever increasing computer power has brought ensemble methods to the forefront of multi-level inference techniques. [sent-388, score-0.506]
61 For that reason, we would like to single out those papers of our collection that have proposed or applied ensemble methods: Lutz (2006) used boosted shallow decision trees for his winning entries in two consecutive challenges. [sent-389, score-0.252]
62 Using random subsets of features seems to be a winning strategy, which was applied by others to ensembles of trees using both boosting and bagging (Tuv et al. [sent-401, score-0.21]
63 The base learners are then combined with an weighting scheme based on an information theoretic criterion, instead on weighting the models with the posterior probability as in Bayesian model averaging. [sent-405, score-0.201]
64 This can be understood by the observation that when the posterior distribution is sharply peaked around the posterior mode, averaging is almost the same as selecting the MAP model. [sent-409, score-0.216]
65 While Wichard (2007) evaluates classifiers independently, IBM team (2009) uses a forward selection method, adding a new candidate in the ensemble based on the new performance of the ensemble. [sent-412, score-0.304]
66 2 PAC Bayes Approaches Unifying Bayesian and frequentist model selection procedures under the umbrella of multi-level inference may shed new light on correspondences between methods and have a practical impact on the design of toolboxes incorporating model selection algorithms. [sent-414, score-0.742]
67 But there are yet more synergies to be exploited between the Bayesian and the frequentist framework. [sent-415, score-0.284]
68 The PAC learning framework (Probably Approximately Correct), introduced by Valiant (1984) and later recognized to closely resemble the approach of the Russian school popularized in the US by Vapnik (1979), has become the beacon of frequentist learning theoretic approaches. [sent-417, score-0.284]
69 The PAC framework is rooted in the frequentist philosophy of defining probability in terms of frequencies of occurrence of events and bounding differences between mathematical expectations and frequencies of events, which vanish with increasingly large sample sizes (law of large numbers). [sent-424, score-0.284]
70 This is an important paradigm shift, which bridges the gap between the frequentist structural risk minimization approach to model selection (Vapnik, 1998) and the Bayesian prior approach. [sent-431, score-0.719]
71 3 Open Problems • Domain knowledge: From the earliest embodiments of Okcham’s razor using the number of free parameters to modern techniques of regularization and bi-level optimization, model selection has come a long way. [sent-439, score-0.269]
72 The Particle Swarm Optimization (PSO) model selection method can find the best models in the toolbox and reproduce the results of the challenges (H. [sent-448, score-0.279]
73 Much remains to be done to incorporate filter and wrapper model selection algorithms in machine learning toolboxes. [sent-451, score-0.273]
74 When no target variable is available as “teaching signal” one can still define regularized risk functionals and multi-level optimization problems (Smola et al. [sent-453, score-0.233]
75 • Semi-supervised learning: Very little has been done for model selection in semi-supervised learning problems, in which only some training instances come with target values. [sent-462, score-0.262]
76 (2005) introduced the co-validation method, which uses the disagreement of various models on the predictions over the unlabeled data as a model selection tool. [sent-466, score-0.25]
77 Deciding whether all or part of the unlabeled data should be used for training (data selection) may also be considered a model selection problem. [sent-469, score-0.262]
78 Model selection algorithms (or ensemble methods) which often require many models to be trained (e. [sent-484, score-0.304]
79 Finally, Reunanen (2007) proposed clever ways of organizing nested cross-validation evaluations in multiple level of inference model selection using cross-indexing. [sent-498, score-0.258]
80 Rather, we think of model selection and ensemble methods as two options to perform multi-level inference, which can be formalized in a unified way. [sent-511, score-0.376]
81 For instance, the ARD prior is implemented by defining the Gaussian kernel (for positive hyper-parameters κi ): n K(xh , xk ) = exp − ∑ κi (xh, j − xk, j )2 j=1 instead of the regular Gaussian kernel K(xh , xk ) = exp −κ xh − xk 2 . [sent-524, score-0.53]
82 In an ensemble method, the individual learning machines that are part of the ensemble. [sent-526, score-0.22]
83 Bagging is a parallel ensemble method (all base learners are built independently from training data subsets). [sent-529, score-0.279]
84 The ensemble predictions are made by averaging the predictions of the baser learners. [sent-532, score-0.316]
85 The ensemble approximates ED ( f (x, D)), where f (x, D) is a function from the model class F, trained with m examples and ED (. [sent-533, score-0.248]
86 An estimator of the expected risk that is the average of the loss over a finite number of examples drawn according to P(x, y): Remp = (1/m) ∑m L ( f (xk ), yk ). [sent-563, score-0.211]
87 The two most prominent ensemble methods are bagging (Breiman, 1996) and boosting (Freund and Schapire, 1996). [sent-567, score-0.299]
88 Bagging is a parallel ensemble method (all trees are built independently from training data subsets), while boosting is a serial ensemble method (trees complementing each other are progressively added to decrease the residual error). [sent-568, score-0.464]
89 If both P(D| f ) and P( f ) are exponentially distributed (P(y|x, f ) = exp − L ( f (x), y) and P( f ) = exp − Ω[ f ]), then MAP learning is equivalent to the minimization of a regularized risk functional. [sent-592, score-0.234]
90 In the simplest case, P(x) and the noise model are not subject to training (the values of x are fixed and the noise model is known). [sent-596, score-0.292]
91 , obtaining large model likelihood or low empirical risk values, computed from training data), but generalizing poorly on new test examples. [sent-616, score-0.278]
92 Most modern model selection strategies claim some affiliation with Ockham’s razor, but the number of free parameters is replaced by a measure of capacity or complexity of the model class, C[F ]. [sent-621, score-0.317]
93 One particularly successful regularizer is the 2-norm regularizer f 2 for model functions f (x) = ∑m αk K(x, xk ) belonging to a Reproducing Kernel k=1 H Hilbert Space H (kernel methods). [sent-635, score-0.353]
94 In the general case, f 2 = fK −1 f = αT Kα, where f = [ f (xk )]m is the vector of predictions of the training k=1 H examples, α = [αk ]m and K = [K(xh , xk ], h = 1, . [sent-638, score-0.207]
95 Given a model space or hypothesis space F of functions y = f (x), and a loss function L ( f (x), y), we want to find the function f ∗ ∈ F that minimizes the expected R risk R[ f ] = L ( f (x), y) dP(x, y). [sent-647, score-0.216]
96 k=1 The minimization of the empirical risk is the basis of many machine learning approaches to selecting f ∗ , but minimizing regularized risk functionals is often preferred. [sent-650, score-0.418]
97 In the risk minimization framework, it is not assumed that the model space includes a function or “concept”, which generated the data (see concept space and hypothesis space). [sent-670, score-0.257]
98 Over-fitting in model selection and subsequent selection bias in performance evaluation. [sent-771, score-0.328]
99 Winning the KDD cup orange challenge with ensemble selection. [sent-1019, score-0.258]
100 A pattern search method for model selection of Support Vector Regression. [sent-1093, score-0.26]
wordName wordTfidf (topN-words)
[('guyon', 0.349), ('frequentist', 0.284), ('saffari', 0.231), ('cawley', 0.177), ('ensemble', 0.176), ('affari', 0.176), ('ror', 0.176), ('uyon', 0.176), ('awley', 0.15), ('risk', 0.144), ('jmlr', 0.137), ('bayesian', 0.131), ('selection', 0.128), ('boull', 0.122), ('microtome', 0.108), ('election', 0.104), ('odel', 0.104), ('xk', 0.095), ('regularizer', 0.093), ('hands', 0.092), ('posterior', 0.088), ('ockham', 0.081), ('rtr', 0.081), ('challenges', 0.079), ('remp', 0.077), ('url', 0.074), ('bagging', 0.073), ('wrapper', 0.073), ('talbot', 0.073), ('model', 0.072), ('razor', 0.069), ('kernel', 0.069), ('pac', 0.069), ('wrappers', 0.067), ('yk', 0.067), ('embedded', 0.064), ('gavin', 0.062), ('training', 0.062), ('neal', 0.06), ('search', 0.06), ('inference', 0.058), ('lutz', 0.058), ('dror', 0.057), ('xh', 0.057), ('clopinet', 0.054), ('escalante', 0.054), ('pranckeviciene', 0.054), ('rreg', 0.054), ('rosset', 0.054), ('ard', 0.053), ('orlando', 0.052), ('tting', 0.051), ('ensembles', 0.051), ('dp', 0.051), ('predictions', 0.05), ('lter', 0.05), ('prior', 0.05), ('boosting', 0.05), ('regularized', 0.049), ('rf', 0.048), ('boser', 0.048), ('topic', 0.048), ('mcmc', 0.047), ('multilevel', 0.046), ('mtr', 0.046), ('capacity', 0.045), ('theories', 0.045), ('machines', 0.044), ('preprocessing', 0.043), ('noise', 0.043), ('agnostic', 0.042), ('challenge', 0.042), ('forests', 0.042), ('minimization', 0.041), ('learners', 0.041), ('azencott', 0.041), ('circumventing', 0.041), ('claeskens', 0.041), ('dahinden', 0.041), ('frequentists', 0.041), ('glossary', 0.041), ('logitboost', 0.041), ('mva', 0.041), ('pozdnoukhov', 0.041), ('rva', 0.041), ('somorjai', 0.041), ('theoreticians', 0.041), ('tuv', 0.041), ('functionals', 0.04), ('cup', 0.04), ('feature', 0.04), ('papers', 0.04), ('averaging', 0.04), ('ijcnn', 0.038), ('regularizers', 0.038), ('crossvalidation', 0.038), ('argmax', 0.037), ('evaluation', 0.037), ('priors', 0.037), ('winning', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley
Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-fitting, regularization, minimum description length
2 0.19839446 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
Author: Gavin C. Cawley, Nicola L. C. Talbot
Abstract: Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a nonnegligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds. Keywords: model selection, performance evaluation, bias-variance trade-off, selection bias, overfitting
3 0.094194256 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos
Abstract: We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples. The selected feature sets can be used for causal discovery and classiÄ?Ĺš cation. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms. The resulting algorithms are sound under well-deÄ?Ĺš ned sufÄ?Ĺš cient conditions. In a Ä?Ĺš rst set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare to other local causal discovery methods and to state-of-the-art non-causal feature selection methods using real data. A second set of experimental evaluations compares the algorithms in terms of ability to induce local causal neighborhoods using simulated and resimulated data and examines the relation of predictivity with causal induction performance. Our experiments demonstrate, consistently with causal feature selection theory, that local causal feature selection methods (under broad assumptions encompassing appropriate family of distribuc 2010 Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani and Xenofon D. Koutsoukos. A LIFERIS , S TATNIKOV, T SAMARDINOS , M ANI AND KOUTSOUKOS tions, types of classiÄ?Ĺš ers, and loss functions) exhibit strong feature set parsimony, high predictivity and local causal interpretability. Although non-causal feature selection methods are often used in practice to shed light on causal relationships, we Ä?Ĺš nd that they cannot be interpreted causally even when they achieve excellent predictivity. Therefore we conclude that only local causal techniques should be used when insight into causal structure is sought. In a companion paper we examine in depth the behavior of GLL algorithms, provide extensions, and show
5 0.081825919 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian
Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models
6 0.073896699 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
7 0.071331382 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?
8 0.068762101 40 jmlr-2010-Fast and Scalable Local Kernel Machines
9 0.068368465 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
10 0.067196816 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
11 0.066170357 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
12 0.065654188 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
13 0.063465126 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency
14 0.0633756 15 jmlr-2010-Approximate Tree Kernels
15 0.063196093 63 jmlr-2010-Learning Instance-Specific Predictive Models
16 0.061841577 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
17 0.059552468 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
18 0.058211766 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
19 0.057399511 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
20 0.056552567 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
topicId topicWeight
[(0, -0.298), (1, 0.073), (2, -0.012), (3, 0.081), (4, -0.005), (5, 0.15), (6, 0.063), (7, 0.002), (8, -0.076), (9, -0.041), (10, -0.005), (11, -0.116), (12, -0.08), (13, -0.066), (14, -0.09), (15, -0.061), (16, 0.116), (17, -0.054), (18, 0.09), (19, 0.005), (20, -0.054), (21, 0.001), (22, -0.022), (23, 0.117), (24, -0.091), (25, -0.171), (26, -0.203), (27, 0.047), (28, -0.376), (29, -0.101), (30, -0.072), (31, -0.015), (32, 0.025), (33, 0.091), (34, -0.145), (35, -0.126), (36, -0.04), (37, -0.028), (38, 0.048), (39, 0.099), (40, 0.057), (41, 0.099), (42, -0.019), (43, 0.021), (44, 0.012), (45, -0.047), (46, 0.019), (47, -0.004), (48, 0.026), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.93930358 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley
Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-fitting, regularization, minimum description length
2 0.86249512 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
Author: Gavin C. Cawley, Nicola L. C. Talbot
Abstract: Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a nonnegligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds. Keywords: model selection, performance evaluation, bias-variance trade-off, selection bias, overfitting
3 0.53347319 77 jmlr-2010-Model-based Boosting 2.0
Author: Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner
Abstract: We describe version 2.0 of the R add-on package mboost. The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data. Keywords: component-wise functional gradient descent, splines, decision trees 1. Overview The R add-on package mboost (Hothorn et al., 2010) implements tools for fitting and evaluating a variety of regression and classification models that have been suggested in machine learning and statistics. Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. The algorithm originates from the statistical view on boosting algorithms (Friedman et al., 2000; Bühlmann and Yu, 2003). The theory and its implementation in mboost allow for fitting complex prediction models, taking potentially many interactions of features into account, as well as for fitting additive and linear models. The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. H OTHORN , B ÜHLMANN , K NEIB , S CHMID AND H OFNER a subset of the features into account. Special cases are linear models f (x) = x⊤ β, additive models f (x) = ∑ p f j (x( j) ), where f j is a function of the jth feature x( j) only (smooth functions or j=1 stumps, for example) or a more complex function where f (x) is implicitly defined as the sum of multiple decision trees including higher-order interactions. The latter case corresponds to boosting with trees. Combinations of these structures are also possible. The most important advantage of such a decomposition of the regression function is that each component of a fitted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. The characteristic ξ of the distribution depends on the measurement scale of the response Y and the scientific question to be answered. For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. The definition of ξ is determined by defining a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i.e., limited number of boosting iterations). The model is then fitted using n p ( fˆ1 , . . . , fˆp ) = argmin ∑ wi ρ yi , ∑ f j (x) . ( f1 ,..., f p ) i=1 j=1 Here (yi , xi ), i = 1, . . . , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. The component-wise boosting algorithm starts with some offset for f and iteratively fits residuals defined by the negative gradient of the loss function evaluated at the current fit by updating only the best model component in each iteration. The details have been described by Bühlmann and Yu (2003). Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j defines the final model. A more thorough introduction to boosting with applications in statistics based on version 1.0 of mboost is given by Bühlmann and Hothorn (2007). As of version 2.0, the package allows for fitting models to binary, numeric, ordered and censored responses, that is, regression of the mean, robust regression, classification (logistic and exponential loss), ordinal regression,1 quantile1 and expectile1 regression, censored regression (including Cox, Weibull1 , log-logistic1 or lognormal1 models) as well as Poisson and negative binomial regression1 for count data can be performed. Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be fitted (e.g., in geoadditive regression, Kneib et al., 2009). 2. Design and Implementation The package incorporates an infrastructure for representing loss functions (so-called ‘families’), base-learners defining the structure of the regression function and thus the model components f j , and a generic implementation of component-wise functional gradient descent. The main progress in version 2.0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. In comparison to the 1.0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more unified way. Specifically, the new code basis results in an enhanced and more user-friendly formula interface. In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. 1. Model family is new in version 2.0 or was added after the release of mboost 1.0. 2110 M ODEL - BASED B OOSTING 2.0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e.g., for fitting smooth functions via penalized splines, varying coefficients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. As a major improvement over the 1.0 series, computations on larger data sets (both with respect to the number of observations and the number of variables) are now facilitated by memory efficient implementations of the base-learners, mostly by applying sparse matrix techniques (package Matrix, Bates and Mächler, 2009) and parallelization for a cross-validation-based choice of the number of boosting iterations (per default via package multicore, Urbanek, 2009). A more elaborate description of mboost 2.0 features is available from the mboost vignette.2 3. User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). The aim is to describe the DXA measurements as a function of the anthropometric features. Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al., 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. For the data (available as data frame bodyfat), the model structure is specified via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). The loss function (here, the check function for the 0.5 quantile) along with its negative gradient function are defined by the QuantReg(0.5) family (Fenske et al., 2009). The model structure (specified using the formula fm), the data and the family are then passed to function mboost() for model fitting:3 R> library(
4 0.458361 63 jmlr-2010-Learning Instance-Specific Predictive Models
Author: Shyam Visweswaran, Gregory F. Cooper
Abstract: This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-specific heuristic to locate a set of suitable models to average over. We call this method the instance-specific Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using five different performance measures and its performance was compared to that of several commonly used predictive algorithms, including nave Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor, Lazy Bayesian Rules, and AdaBoost. Over all the data sets, the ISMB algorithm performed better on average on all performance measures against all the comparison algorithms. Keywords: instance-specific, Bayesian network, Markov blanket, Bayesian model averaging
5 0.38545167 40 jmlr-2010-Fast and Scalable Local Kernel Machines
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
6 0.36423963 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
7 0.36160991 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
8 0.34903502 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
9 0.33591086 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
10 0.32720158 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
12 0.32111791 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design
13 0.31671646 10 jmlr-2010-An Exponential Model for Infinite Rankings
14 0.31438503 15 jmlr-2010-Approximate Tree Kernels
15 0.31295046 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
16 0.30609149 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?
17 0.29829034 109 jmlr-2010-Stochastic Composite Likelihood
18 0.29054782 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
19 0.28927416 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
20 0.2795577 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
topicId topicWeight
[(3, 0.016), (4, 0.014), (8, 0.02), (15, 0.013), (21, 0.024), (32, 0.067), (33, 0.015), (36, 0.041), (37, 0.062), (75, 0.148), (81, 0.012), (85, 0.064), (96, 0.02), (97, 0.403)]
simIndex simValue paperId paperTitle
same-paper 1 0.72241825 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley
Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-fitting, regularization, minimum description length
2 0.43295372 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
Author: Giovanni Cavallanti, Nicolò Cesa-Bianchi, Claudio Gentile
Abstract: We introduce new Perceptron-based algorithms for the online multitask binary classification problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks. We achieve these improvements using various types of regularization that bias our algorithms towards specific notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace. In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our analysis also introduces a matrix-based multitask extension of the p-norm Perceptron, which is used to implement spectral co-regularization. Experiments on real-world data sets complement and support our theoretical findings. Keywords: mistake bounds, perceptron algorithm, multitask learning, spectral regularization
Author: Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol
Abstract: We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations. Keywords: deep learning, unsupervised feature learning, deep belief networks, autoencoders, denoising
4 0.43108681 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
Author: Erik Ĺ trumbelj, Igor Kononenko
Abstract: We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method’s initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful. Keywords: data postprocessing, classification, explanation, visualization
5 0.42785615 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project
Author: Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten
Abstract: WEKA is a popular machine learning workbench with a development life of nearly two decades. This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software’s functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project. Keywords: machine learning software, open source software
6 0.42492265 56 jmlr-2010-Introduction to Causal Inference
7 0.42127582 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
8 0.41954994 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
9 0.41941157 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
11 0.41744259 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
12 0.41693014 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
13 0.41638112 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
14 0.41616836 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
15 0.4126496 63 jmlr-2010-Learning Instance-Specific Predictive Models
16 0.4121567 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
17 0.41078666 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
18 0.41064492 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
19 0.41047883 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
20 0.41024956 109 jmlr-2010-Stochastic Composite Likelihood