hunch_net hunch_net-2005 hunch_net-2005-131 knowledge-graph by maker-knowledge-mining

131 hunch net-2005-11-16-The Everything Ensemble Edge


meta infos for this blog

Source: html

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. [sent-1, score-0.553]

2 This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. [sent-2, score-0.227]

3 A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. [sent-3, score-1.232]

4 Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . [sent-4, score-0.57]

5 For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. [sent-5, score-0.149]

6 A series of conclusions can be drawn from the observations. [sent-6, score-0.087]

7 ( Calibrated ) boosted decision trees appear to perform best, in general although support vector machines and neural networks give credible near-best performance. [sent-7, score-1.084]

8 The metalearning algorithm which simply chooses the best (based upon a small validation set) performs much better. [sent-8, score-0.77]

9 A metalearning algorithm which combines the predictors in an ensemble using stepwise refinement of validation set performance appears to perform even better. [sent-9, score-1.274]

10 Despite all these caveats, the story told above seems compelling: if you want maximum performance, you must try many methods and somehow combine them. [sent-11, score-0.222]

11 The most significant drawback of this method is computational complexity. [sent-12, score-0.201]

12 Techniques for reducing the computational complexity are therefore of significant interest. [sent-13, score-0.201]

13 It seems plausible that there exists some learning algorithm which typically performs well whenever any of the above algorithms can perform well at a computational cost which is significantly less than “run all algorithm on all settings and test”. [sent-14, score-0.854]

14 Why have the best efforts of many machine learning algorithm designers failed to capture all the potential predictive strength into a single coherent learning algorithm? [sent-17, score-0.662]

15 Why do ensembles give such a significant consistent edge in practice? [sent-18, score-0.386]

16 A great many papers follow the scheme: invent a new way to create ensembles, test, observe that it improves prediction performance at the cost of more computation, and publish. [sent-19, score-0.317]

17 There are several pieces of theory explain individual ensemble methods, but we seem to have no convincing theoretical statement explaining why they almost always work. [sent-20, score-0.13]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('boosted', 0.262), ('trees', 0.22), ('ensembles', 0.212), ('metalearning', 0.175), ('perform', 0.167), ('decision', 0.161), ('performs', 0.157), ('performance', 0.149), ('methods', 0.147), ('representative', 0.141), ('caveats', 0.141), ('validation', 0.13), ('efforts', 0.13), ('ensemble', 0.13), ('rough', 0.127), ('algorithm', 0.127), ('test', 0.119), ('using', 0.116), ('computational', 0.108), ('upon', 0.1), ('set', 0.099), ('neural', 0.098), ('datasets', 0.097), ('networks', 0.095), ('refinement', 0.094), ('calibrated', 0.094), ('calibrating', 0.094), ('crew', 0.094), ('margins', 0.094), ('significant', 0.093), ('size', 0.091), ('empirical', 0.091), ('designers', 0.087), ('conclusions', 0.087), ('stepwise', 0.087), ('cost', 0.086), ('single', 0.083), ('whenever', 0.082), ('caruana', 0.082), ('invent', 0.082), ('neighbors', 0.082), ('give', 0.081), ('best', 0.081), ('run', 0.079), ('strength', 0.079), ('scheme', 0.079), ('curve', 0.075), ('roc', 0.075), ('failed', 0.075), ('combine', 0.075)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

2 0.22399612 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

3 0.21811172 407 hunch net-2010-08-23-Boosted Decision Trees for Deep Learning

Introduction: About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms. This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance. Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing wha

4 0.16539803 26 hunch net-2005-02-21-Problem: Cross Validation

Introduction: The essential problem here is the large gap between experimental observation and theoretical understanding. Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . For each fold, a classifier is trained on the other folds and then test on the fold. Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate. Past Work (I’ll add more as I remember/learn.) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper . Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . Avrim Blum, Adam Kalai , and myself analyzed cross validation and found tha

5 0.15325503 347 hunch net-2009-03-26-Machine Learning is too easy

Introduction: One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others. There are two fundamental reasons why this is possible. For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples. There is nothing like a unifying problem defining the field. In many other areas there

6 0.14067999 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

7 0.13688639 201 hunch net-2006-08-07-The Call of the Deep

8 0.1318287 362 hunch net-2009-06-26-Netflix nearly done

9 0.13136616 41 hunch net-2005-03-15-The State of Tight Bounds

10 0.12458495 3 hunch net-2005-01-24-The Humanloop Spectrum of Machine Learning

11 0.12267027 16 hunch net-2005-02-09-Intuitions from applied learning

12 0.11967744 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

13 0.11852448 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

14 0.11610167 28 hunch net-2005-02-25-Problem: Online Learning

15 0.11408382 235 hunch net-2007-03-03-All Models of Learning have Flaws

16 0.11350004 104 hunch net-2005-08-22-Do you believe in induction?

17 0.11153508 14 hunch net-2005-02-07-The State of the Reduction

18 0.11037438 177 hunch net-2006-05-05-An ICML reject

19 0.10879858 332 hunch net-2008-12-23-Use of Learning Theory

20 0.10793605 152 hunch net-2006-01-30-Should the Input Representation be a Vector?


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.262), (1, 0.142), (2, 0.052), (3, -0.039), (4, 0.049), (5, -0.01), (6, -0.059), (7, 0.071), (8, 0.037), (9, -0.034), (10, -0.122), (11, 0.118), (12, 0.038), (13, -0.112), (14, 0.056), (15, 0.118), (16, 0.108), (17, 0.013), (18, -0.153), (19, 0.112), (20, 0.009), (21, -0.031), (22, -0.033), (23, 0.009), (24, -0.033), (25, -0.015), (26, -0.021), (27, -0.047), (28, 0.033), (29, 0.013), (30, 0.007), (31, 0.021), (32, 0.012), (33, -0.013), (34, 0.101), (35, 0.134), (36, 0.025), (37, -0.005), (38, 0.114), (39, 0.099), (40, 0.064), (41, -0.091), (42, -0.067), (43, -0.023), (44, -0.04), (45, 0.055), (46, 0.001), (47, 0.011), (48, -0.06), (49, 0.024)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97829318 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

2 0.76786923 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

3 0.71865666 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

Introduction: Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion. The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance. AROC is the area under the ROC curve. It is a single number summary of performance. The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.” The ROC vs. accuracy discussion is o

4 0.68658727 26 hunch net-2005-02-21-Problem: Cross Validation

Introduction: The essential problem here is the large gap between experimental observation and theoretical understanding. Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . For each fold, a classifier is trained on the other folds and then test on the fold. Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate. Past Work (I’ll add more as I remember/learn.) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper . Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . Avrim Blum, Adam Kalai , and myself analyzed cross validation and found tha

5 0.63542932 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

Introduction: There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others: Neural Networks. Neural networks use randomization to assign initial weights. Boltzmann Machines/ Deep Belief Networks . Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees. A basic question is: “Should there

6 0.62817889 32 hunch net-2005-02-27-Antilearning: When proximity goes bad

7 0.60233074 56 hunch net-2005-04-14-Families of Learning Theory Statements

8 0.59276599 43 hunch net-2005-03-18-Binomial Weighting

9 0.56976157 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

10 0.56554699 201 hunch net-2006-08-07-The Call of the Deep

11 0.55928385 163 hunch net-2006-03-12-Online learning or online preservation of learning?

12 0.55655783 177 hunch net-2006-05-05-An ICML reject

13 0.55506289 407 hunch net-2010-08-23-Boosted Decision Trees for Deep Learning

14 0.55390555 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

15 0.55190527 138 hunch net-2005-12-09-Some NIPS papers

16 0.53250027 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

17 0.5318976 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

18 0.52933478 148 hunch net-2006-01-13-Benchmarks for RL

19 0.5288437 362 hunch net-2009-06-26-Netflix nearly done

20 0.52789557 67 hunch net-2005-05-06-Don’t mix the solution into the problem


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.024), (2, 0.043), (14, 0.033), (16, 0.011), (27, 0.248), (38, 0.155), (46, 0.018), (48, 0.022), (53, 0.125), (55, 0.077), (61, 0.012), (64, 0.036), (77, 0.033), (92, 0.011), (94, 0.045), (95, 0.034)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96495461 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

2 0.94055462 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

Introduction: A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1} , X x [0,1] , X x {1,2,3,4,5} , etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero. For this post, we can think of a learning reduction as A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification). A mapping Q from predictors for type T’ to predictors for type T . The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form: Theorem For all base predictors b , for all distributions D over examples of type T : E (x,y) ~ D L T (y,Q(b,x)) <= f(E (x’,y’)~R(D) L T’ (y’,b(x’))) Here L T is the loss for the type T problem and L T’ is the loss for the type T’ problem. Also, R(D) is the distribution ov

3 0.93431073 353 hunch net-2009-05-08-Computability in Artificial Intelligence

Introduction: Normally I do not blog, but John kindly invited me to do so. Since computability issues play a major role in Artificial Intelligence and Machine Learning, I would like to take the opportunity to comment on that and raise some questions. The general attitude is that AI is about finding efficient smart algorithms. For large parts of machine learning, the same attitude is not too dangerous. If you want to concentrate on conceptual problems, simply become a statistician. There is no analogous escape for modern research on AI (as opposed to GOFAI rooted in logic). Let me show by analogy why limiting research to computational questions is bad for any field. Except in computer science, computational aspects play little role in the development of fundamental theories: Consider e.g. set theory with axiom of choice, foundations of logic, exact/full minimax for zero-sum games, quantum (field) theory, string theory, … Indeed, at least in physics, every new fundamental theory seems to

4 0.93089855 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

5 0.92894709 233 hunch net-2007-02-16-The Forgetting

Introduction: How many papers do you remember from 2006? 2005? 2002? 1997? 1987? 1967? One way to judge this would be to look at the citations of the papers you write—how many came from which year? For myself, the answers on recent papers are: year 2006 2005 2002 1997 1987 1967 count 4 10 5 1 0 0 This spectrum is fairly typical of papers in general. There are many reasons that citations are focused on recent papers. The number of papers being published continues to grow. This is not a very significant effect, because the rate of publication has not grown nearly as fast. Dead men don’t reject your papers for not citing them. This reason seems lame, because it’s a distortion from the ideal of science. Nevertheless, it must be stated because the effect can be significant. In 1997, I started as a PhD student. Naturally, papers after 1997 are better remembered because they were absorbed in real time. A large fraction of people writing papers and a

6 0.92132342 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

7 0.9150095 14 hunch net-2005-02-07-The State of the Reduction

8 0.91413528 12 hunch net-2005-02-03-Learning Theory, by assumption

9 0.91277856 26 hunch net-2005-02-21-Problem: Cross Validation

10 0.90199918 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

11 0.89575154 259 hunch net-2007-08-19-Choice of Metrics

12 0.89471072 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

13 0.8942129 251 hunch net-2007-06-24-Interesting Papers at ICML 2007

14 0.89092034 82 hunch net-2005-06-17-Reopening RL->Classification

15 0.88773167 194 hunch net-2006-07-11-New Models

16 0.88761324 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

17 0.88585448 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class

18 0.88564736 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

19 0.8853876 370 hunch net-2009-09-18-Necessary and Sufficient Research

20 0.8841117 201 hunch net-2006-08-07-The Call of the Deep