hunch_net hunch_net-2005 hunch_net-2005-19 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe
sentIndex sentText sentNum sentScore
1 For this post, I will define overfitting more generally as over-representing the performance of systems. [sent-2, score-0.675]
2 There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. [sent-3, score-1.318]
3 Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. [sent-6, score-0.359]
4 Parameter tweak overfitting Use a learning algorithm with many parameters. [sent-12, score-0.463]
5 For example, choosing the features so as to optimize test set performance can achieve this. [sent-14, score-0.599]
6 Brittle measure Use a measure of performance which is especially brittle to overfitting. [sent-16, score-0.769]
7 One common example is pretending that cross validation performance is drawn from an i. [sent-21, score-0.611]
8 Choice of measure Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc. [sent-29, score-0.531]
9 For example, the performance measure directly motivated by the problem. [sent-35, score-0.455]
10 Use a human as part of a learning algorithm and don’t take into account overfitting by the entire human/computer interaction. [sent-40, score-0.696]
11 One example is a human using a clustering algorithm (on training and test examples) to guide learning algorithm choice. [sent-42, score-0.73]
12 Data set selection Chose to report results on some subset of datasets where your algorithm performs well. [sent-44, score-0.59]
13 The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems. [sent-45, score-0.57]
14 Good Contest performance can’t be faked this way. [sent-49, score-0.316]
15 Reprobleming Alter the problem so that your performance improves. [sent-50, score-0.316]
16 For example, take a time series dataset and use cross validation. [sent-51, score-0.408]
17 Old datasets Create an algorithm for the purpose of improving performance on old datasets. [sent-56, score-0.833]
18 After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. [sent-57, score-0.652]
19 Some conferences have canonical datasets that have been used for a decade… Prefer simplicity in algorithm design. [sent-58, score-0.512]
20 Making test examples not publicly available for datasets slows the feedback design process but does not eliminate it. [sent-60, score-0.58]
wordName wordTfidf (topN-words)
[('overfitting', 0.359), ('performance', 0.316), ('datasets', 0.308), ('test', 0.195), ('brittle', 0.175), ('overrepresenting', 0.15), ('measure', 0.139), ('cross', 0.127), ('percent', 0.123), ('reject', 0.122), ('prefer', 0.116), ('use', 0.113), ('old', 0.105), ('algorithm', 0.104), ('dataset', 0.102), ('canonical', 0.1), ('human', 0.096), ('validation', 0.092), ('training', 0.09), ('selection', 0.09), ('set', 0.088), ('papers', 0.086), ('confidence', 0.08), ('standard', 0.078), ('method', 0.078), ('methods', 0.078), ('examples', 0.077), ('example', 0.076), ('rate', 0.075), ('multiclass', 0.074), ('improvement', 0.073), ('choose', 0.072), ('account', 0.071), ('statistics', 0.071), ('captured', 0.067), ('discount', 0.067), ('indicating', 0.067), ('misuse', 0.067), ('participated', 0.067), ('reprobleming', 0.067), ('take', 0.066), ('using', 0.065), ('difficult', 0.064), ('ambiguous', 0.062), ('asymmetric', 0.062), ('bonus', 0.062), ('conjunction', 0.062), ('detect', 0.062), ('doubts', 0.062), ('mutual', 0.062)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 19 hunch net-2005-02-14-Clever Methods of Overfitting
Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe
2 0.22399612 131 hunch net-2005-11-16-The Everything Ensemble Edge
Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v
3 0.18508501 26 hunch net-2005-02-21-Problem: Cross Validation
Introduction: The essential problem here is the large gap between experimental observation and theoretical understanding. Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . For each fold, a classifier is trained on the other folds and then test on the fold. Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate. Past Work (I’ll add more as I remember/learn.) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper . Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . Avrim Blum, Adam Kalai , and myself analyzed cross validation and found tha
4 0.15580584 177 hunch net-2006-05-05-An ICML reject
Introduction: Hal , Daniel , and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to
5 0.15199937 332 hunch net-2008-12-23-Use of Learning Theory
Introduction: I’ve had serious conversations with several people who believe that the theory in machine learning is “only useful for getting papers published”. That’s a compelling statement, as I’ve seen many papers where the algorithm clearly came first, and the theoretical justification for it came second, purely as a perceived means to improve the chance of publication. Naturally, I disagree and believe that learning theory has much more substantial applications. Even in core learning algorithm design, I’ve found learning theory to be useful, although it’s application is more subtle than many realize. The most straightforward applications can fail, because (as expectation suggests) worst case bounds tend to be loose in practice (*). In my experience, considering learning theory when designing an algorithm has two important effects in practice: It can help make your algorithm behave right at a crude level of analysis, leaving finer details to tuning or common sense. The best example
6 0.14785241 284 hunch net-2008-01-18-Datasets
7 0.1468581 56 hunch net-2005-04-14-Families of Learning Theory Statements
8 0.14397426 362 hunch net-2009-06-26-Netflix nearly done
9 0.14375114 454 hunch net-2012-01-30-ICML Posters and Scope
10 0.14300765 371 hunch net-2009-09-21-Netflix finishes (and starts)
11 0.14286964 251 hunch net-2007-06-24-Interesting Papers at ICML 2007
12 0.14188342 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge
13 0.14008588 41 hunch net-2005-03-15-The State of Tight Bounds
14 0.13825096 129 hunch net-2005-11-07-Prediction Competitions
15 0.13394485 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms
16 0.13030234 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC
17 0.12988554 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
18 0.12865019 14 hunch net-2005-02-07-The State of the Reduction
19 0.12607527 12 hunch net-2005-02-03-Learning Theory, by assumption
20 0.1201584 163 hunch net-2006-03-12-Online learning or online preservation of learning?
topicId topicWeight
[(0, 0.28), (1, 0.111), (2, 0.076), (3, -0.011), (4, 0.048), (5, -0.029), (6, -0.023), (7, 0.038), (8, 0.0), (9, -0.022), (10, -0.139), (11, 0.261), (12, 0.107), (13, -0.086), (14, 0.02), (15, -0.016), (16, 0.049), (17, -0.031), (18, -0.122), (19, 0.072), (20, 0.029), (21, -0.078), (22, -0.037), (23, -0.057), (24, -0.021), (25, -0.002), (26, 0.013), (27, -0.103), (28, -0.058), (29, -0.074), (30, -0.022), (31, 0.09), (32, 0.11), (33, -0.076), (34, 0.133), (35, 0.119), (36, -0.034), (37, -0.02), (38, 0.059), (39, 0.058), (40, 0.142), (41, 0.047), (42, 0.006), (43, 0.038), (44, 0.069), (45, -0.01), (46, -0.085), (47, 0.02), (48, -0.051), (49, 0.048)]
simIndex simValue blogId blogTitle
same-blog 1 0.98487848 19 hunch net-2005-02-14-Clever Methods of Overfitting
Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe
2 0.769081 131 hunch net-2005-11-16-The Everything Ensemble Edge
Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v
3 0.72403532 284 hunch net-2008-01-18-Datasets
Introduction: David Pennock notes the impressive set of datasets at datawrangling .
4 0.72272569 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC
Introduction: Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion. The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance. AROC is the area under the ROC curve. It is a single number summary of performance. The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.” The ROC vs. accuracy discussion is o
5 0.70389724 56 hunch net-2005-04-14-Families of Learning Theory Statements
Introduction: The diagram above shows a very broad viewpoint of learning theory. arrow Typical statement Examples Past->Past Some prediction algorithm A does almost as well as any of a set of algorithms. Weighted Majority Past->Future Assuming independent samples, past performance predicts future performance. PAC analysis, ERM analysis Future->Future Future prediction performance on subproblems implies future prediction performance using algorithm A . ECOC, Probing A basic question is: Are there other varieties of statements of this type? Avrim noted that there are also “arrows between arrows”: generic methods for transforming between Past->Past statements and Past->Future statements. Are there others?
6 0.70095545 26 hunch net-2005-02-21-Problem: Cross Validation
7 0.62469137 177 hunch net-2006-05-05-An ICML reject
8 0.60404241 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms
9 0.58961362 43 hunch net-2005-03-18-Binomial Weighting
10 0.56674284 362 hunch net-2009-06-26-Netflix nearly done
11 0.56043792 163 hunch net-2006-03-12-Online learning or online preservation of learning?
12 0.55238241 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge
13 0.52909636 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
14 0.51882708 160 hunch net-2006-03-02-Why do people count for learning?
15 0.51462859 129 hunch net-2005-11-07-Prediction Competitions
16 0.51315463 104 hunch net-2005-08-22-Do you believe in induction?
17 0.51037496 133 hunch net-2005-11-28-A question of quantification
18 0.50838566 87 hunch net-2005-06-29-Not EM for clustering at COLT
19 0.48674178 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
20 0.48459539 67 hunch net-2005-05-06-Don’t mix the solution into the problem
topicId topicWeight
[(3, 0.035), (4, 0.027), (16, 0.227), (27, 0.201), (37, 0.013), (38, 0.112), (51, 0.024), (53, 0.111), (55, 0.076), (94, 0.054), (95, 0.041)]
simIndex simValue blogId blogTitle
1 0.95364988 69 hunch net-2005-05-11-Visa Casualties
Introduction: For the Chicago 2005 machine learning summer school we are organizing, at least 5 international students can not come due to visa issues. There seem to be two aspects to visa issues: Inefficiency . The system rejected the student simply by being incapable of even starting to evaluate their visa in less than 1 month of time. Politics . Border controls became much tighter after the September 11 attack. Losing a big chunk of downtown of the largest city in a country will do that. What I (and the students) learned is that (1) is a much larger problem than (2). Only 1 prospective student seems to have achieved an explicit visa rejection. Fixing problem (1) should be a no-brainer, because the lag time almost surely indicates overload, and overload on border controls should worry even people concerned with (2). The obvious fixes to overload are “spend more money” and “make the system more efficient”. With respect to (2), (which is a more minor issue by the numbers) it i
2 0.94838619 176 hunch net-2006-05-01-A conversation between Theo and Pat
Introduction: Pat (the practitioner) I need to do multiclass classification and I only have a decision tree. Theo (the thoeretician) Use an error correcting output code . Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them. Theo Oh? Is your problem a decision list? Pat No, I don’t think so. Theo Hmm. Are the classes well separated by axis aligned splits? Pat Err, maybe. I’m not sure. Theo Well, if they are, under the IID assumption I can tell you how many samples you need. Pat IID? The data is definitely not IID. Theo Oh dear. Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other. Theo Ok. What do you know about your problem? Pat Not much. My friend just gave me the dataset. Theo Then, no one can help you. Pat (What a fuzzy thinker. Theo keeps jumping t
3 0.91426599 414 hunch net-2010-10-17-Partha Niyogi has died
Introduction: from brain cancer. I asked Misha who worked with him to write about it. Partha Niyogi, Louis Block Professor in Computer Science and Statistics at the University of Chicago passed away on October 1, 2010, aged 43. I first met Partha Niyogi almost exactly ten years ago when I was a graduate student in math and he had just started as a faculty in Computer Science and Statistics at the University of Chicago. Strangely, we first talked at length due to a somewhat convoluted mathematical argument in a paper on pattern recognition. I asked him some questions about the paper, and, even though the topic was new to him, he had put serious thought into it and we started regular meetings. We made significant progress and developed a line of research stemming initially just from trying to understand that one paper and to simplify one derivation. I think this was typical of Partha, showing both his intellectual curiosity and his intuition for the serendipitous; having a sense and focus fo
4 0.89230502 197 hunch net-2006-07-17-A Winner
Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).
same-blog 5 0.88128716 19 hunch net-2005-02-14-Clever Methods of Overfitting
Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe
6 0.7772823 447 hunch net-2011-10-10-ML Symposium and ICML details
7 0.76575798 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
8 0.74186528 131 hunch net-2005-11-16-The Everything Ensemble Edge
9 0.74030519 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
10 0.72942924 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
11 0.72814286 12 hunch net-2005-02-03-Learning Theory, by assumption
12 0.72593957 233 hunch net-2007-02-16-The Forgetting
13 0.72159356 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning
14 0.72135556 194 hunch net-2006-07-11-New Models
15 0.71847278 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class
16 0.71825397 353 hunch net-2009-05-08-Computability in Artificial Intelligence
17 0.71746188 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
18 0.71687269 380 hunch net-2009-11-29-AI Safety
19 0.71597254 235 hunch net-2007-03-03-All Models of Learning have Flaws
20 0.71591061 343 hunch net-2009-02-18-Decision by Vetocracy