hunch_net hunch_net-2005 hunch_net-2005-26 knowledge-graph by maker-knowledge-mining

26 hunch net-2005-02-21-Problem: Cross Validation


meta infos for this blog

Source: html

Introduction: The essential problem here is the large gap between experimental observation and theoretical understanding. Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . For each fold, a classifier is trained on the other folds and then test on the fold. Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate. Past Work (I’ll add more as I remember/learn.) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper . Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . Avrim Blum, Adam Kalai , and myself analyzed cross validation and found tha


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The essential problem here is the large gap between experimental observation and theoretical understanding. [sent-1, score-0.07]

2 Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . [sent-2, score-1.228]

3 For each fold, a classifier is trained on the other folds and then test on the fold. [sent-3, score-0.701]

4 Derive a classifier from the K classifiers with a small bound on the true error rate. [sent-5, score-0.342]

5 ) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. [sent-7, score-1.187]

6 Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . [sent-9, score-2.323]

7 Avrim Blum, Adam Kalai , and myself analyzed cross validation and found that you can do at least as well as a test set of size m/K with no additional assumptions using the randomized classifier which draws uniformly from the set of size K . [sent-10, score-2.686]

8 Yoshua Bengio and Yves Grandvalet analyzed cross validation and concluded that there was no unbiased estimator of variance. [sent-11, score-1.235]

9 Matti Kääriäinen noted that you can safely derandomize a stochastic classifier (such as one that randomizes over the K folds) using unlabeled data without additional assumptions. [sent-12, score-0.591]

10 Some Extreme Cases to Sharpen Intuition Suppose on every fold the learned classifier is the same. [sent-13, score-0.386]

11 Then, the cross-validation error should behave something like a test set of size m . [sent-14, score-0.646]

12 This is radically superior to a test set of size m/K . [sent-15, score-0.525]

13 Suppose the learning problem is defined by a distribution which picks y=1 with probability 0. [sent-18, score-0.203]

14 5 , all leave-one-out errors will be 0 and otherwise 1 (like a single coin flip). [sent-21, score-0.071]

15 I’ve worked on it without success and it’s an obvious problem (due to the pervasive use of cross validation) that I suspect other people have considered. [sent-23, score-0.689]

16 Analyzing the dependency structure of cross validation is quite difficult. [sent-24, score-0.941]

17 Impact On any individual problem, solving this might have only have a small impact due to slightly improved judgement of success. [sent-25, score-0.172]

18 But, because cross validation is used extensively, the overall impact of a good solution might be very significant. [sent-26, score-0.977]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('cross', 0.439), ('validation', 0.425), ('folds', 0.26), ('analyzed', 0.238), ('size', 0.222), ('classifier', 0.213), ('fold', 0.173), ('test', 0.161), ('set', 0.142), ('suppose', 0.134), ('additional', 0.12), ('impact', 0.113), ('found', 0.085), ('assumptions', 0.082), ('dana', 0.077), ('dependency', 0.077), ('draws', 0.077), ('inen', 0.077), ('randomizes', 0.077), ('rogers', 0.077), ('ron', 0.077), ('coin', 0.071), ('estimator', 0.071), ('matti', 0.071), ('bound', 0.07), ('problem', 0.07), ('bengio', 0.067), ('blum', 0.067), ('learns', 0.067), ('picks', 0.067), ('ri', 0.067), ('trained', 0.067), ('yoshua', 0.067), ('probability', 0.066), ('behave', 0.062), ('noted', 0.062), ('pervasive', 0.062), ('unbiased', 0.062), ('without', 0.06), ('randomized', 0.059), ('stability', 0.059), ('judgement', 0.059), ('kalai', 0.059), ('kearns', 0.059), ('safely', 0.059), ('uniformly', 0.059), ('error', 0.059), ('avrim', 0.058), ('intuition', 0.058), ('suspect', 0.058)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 26 hunch net-2005-02-21-Problem: Cross Validation

Introduction: The essential problem here is the large gap between experimental observation and theoretical understanding. Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . For each fold, a classifier is trained on the other folds and then test on the fold. Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate. Past Work (I’ll add more as I remember/learn.) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper . Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . Avrim Blum, Adam Kalai , and myself analyzed cross validation and found tha

2 0.30457109 86 hunch net-2005-06-28-The cross validation problem: cash reward

Introduction: I just presented the cross validation problem at COLT . The problem now has a cash prize (up to $500) associated with it—see the presentation for details. The write-up for colt .

3 0.18508501 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

4 0.17669176 41 hunch net-2005-03-15-The State of Tight Bounds

Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec

5 0.16539803 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

6 0.14922437 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

7 0.13908684 85 hunch net-2005-06-28-A COLT paper

8 0.11931419 14 hunch net-2005-02-07-The State of the Reduction

9 0.11396057 43 hunch net-2005-03-18-Binomial Weighting

10 0.10951069 44 hunch net-2005-03-21-Research Styles in Machine Learning

11 0.10232482 444 hunch net-2011-09-07-KDD and MUCMD 2011

12 0.098014772 2 hunch net-2005-01-24-Holy grails of machine learning?

13 0.097367957 239 hunch net-2007-04-18-$50K Spock Challenge

14 0.09509249 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

15 0.091639824 259 hunch net-2007-08-19-Choice of Metrics

16 0.087316073 45 hunch net-2005-03-22-Active learning

17 0.086345941 35 hunch net-2005-03-04-The Big O and Constants in Learning

18 0.08605393 274 hunch net-2007-11-28-Computational Consequences of Classification

19 0.082847178 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

20 0.08004085 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.171), (1, 0.108), (2, 0.047), (3, -0.062), (4, 0.042), (5, -0.05), (6, 0.083), (7, 0.012), (8, -0.052), (9, -0.05), (10, -0.041), (11, 0.228), (12, 0.037), (13, -0.045), (14, 0.092), (15, -0.089), (16, 0.014), (17, 0.046), (18, -0.165), (19, 0.066), (20, -0.015), (21, 0.042), (22, -0.025), (23, 0.018), (24, 0.043), (25, 0.025), (26, 0.029), (27, 0.052), (28, -0.014), (29, -0.023), (30, -0.009), (31, -0.002), (32, 0.006), (33, -0.015), (34, 0.027), (35, 0.241), (36, -0.016), (37, 0.036), (38, 0.003), (39, -0.001), (40, 0.096), (41, -0.057), (42, -0.078), (43, -0.018), (44, -0.002), (45, 0.024), (46, 0.021), (47, 0.009), (48, -0.011), (49, -0.024)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98147827 26 hunch net-2005-02-21-Problem: Cross Validation

Introduction: The essential problem here is the large gap between experimental observation and theoretical understanding. Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . For each fold, a classifier is trained on the other folds and then test on the fold. Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate. Past Work (I’ll add more as I remember/learn.) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper . Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . Avrim Blum, Adam Kalai , and myself analyzed cross validation and found tha

2 0.65762371 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

Introduction: Here are two papers that seem particularly interesting at this year’s COLT. Gilles Blanchard and François Fleuret , Occam’s Hammer . When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper. Adam Tauman Kalai . Learning Nested Halfspaces and Uphill Decision Trees . Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily ch

3 0.64255214 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

4 0.62032646 41 hunch net-2005-03-15-The State of Tight Bounds

Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec

5 0.61934322 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

6 0.59042943 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

7 0.57273203 86 hunch net-2005-06-28-The cross validation problem: cash reward

8 0.55955452 43 hunch net-2005-03-18-Binomial Weighting

9 0.53906542 85 hunch net-2005-06-28-A COLT paper

10 0.49857104 32 hunch net-2005-02-27-Antilearning: When proximity goes bad

11 0.4938935 163 hunch net-2006-03-12-Online learning or online preservation of learning?

12 0.48085663 170 hunch net-2006-04-06-Bounds greater than 1

13 0.47835824 138 hunch net-2005-12-09-Some NIPS papers

14 0.46872884 2 hunch net-2005-01-24-Holy grails of machine learning?

15 0.46679318 56 hunch net-2005-04-14-Families of Learning Theory Statements

16 0.46232793 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science

17 0.45261791 67 hunch net-2005-05-06-Don’t mix the solution into the problem

18 0.45019639 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics

19 0.44632623 244 hunch net-2007-05-09-The Missing Bound

20 0.44177976 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(26, 0.015), (27, 0.223), (38, 0.129), (46, 0.309), (49, 0.021), (51, 0.031), (53, 0.051), (55, 0.052), (94, 0.019), (95, 0.045)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.86525494 397 hunch net-2010-05-02-What’s the difference between gambling and rewarding good prediction?

Introduction: After a major financial crisis , there is much discussion about how finance has become a casino gambling with other’s money, keeping the winnings, and walking away when the money is lost. When thinking about financial reform, all the many losers in the above scenario are apt to take the view that this activity should be completely, or nearly completely curtailed. But, a more thoughtful view is that sometimes there is a real sense in which there are right and wrong decisions, and we as a society would really prefer that the people most likely to make right decisions are making them. A crucial question then is: “What is the difference between gambling and rewarding good prediction?” We discussed this before the financial crisis . The cheat-sheet sketch is that the online learning against an adversary problem, algorithm, and theorems, provide a good mathematical model for thinking about this question. What I would like to do here is map this onto various types of financial t

same-blog 2 0.85412687 26 hunch net-2005-02-21-Problem: Cross Validation

Introduction: The essential problem here is the large gap between experimental observation and theoretical understanding. Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . For each fold, a classifier is trained on the other folds and then test on the fold. Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate. Past Work (I’ll add more as I remember/learn.) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper . Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . Avrim Blum, Adam Kalai , and myself analyzed cross validation and found tha

3 0.75324595 177 hunch net-2006-05-05-An ICML reject

Introduction: Hal , Daniel , and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to

4 0.74897283 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

Introduction: Accountability is a social problem. When someone screws up, do you fire them? Or do you accept the error and let them continue? This is a very difficult problem and we all know of stories where the wrong decision was made. Online learning (as meant here), is a subfield of learning theory which analyzes the online learning model. In the online learning model, there are a set of hypotheses or “experts”. On any instantance x , each expert makes a prediction y . A master algorithm A uses these predictions to form it’s own prediction y A and then learns the correct prediction y * . This process repeats. The goal of online learning is to find a master algorithm A which uses the advice of the experts to make good predictions. In particular, we typically want to guarantee that the master algorithm performs almost as well as the best expert. If L(e) is the loss of expert e and L(A) is the loss of the master algorithm, it is often possible to prove: L(A) les

5 0.70887786 11 hunch net-2005-02-02-Paper Deadlines

Introduction: It’s conference season, and smell of budding papers is in the air. IJCAI 2005 , January 21 COLT 2005 , February 2 KDD 2005 , February 18 ICML 2005 , March 8 UAI 2005 , March 16 AAAI 2005 , March 18

6 0.64738232 131 hunch net-2005-11-16-The Everything Ensemble Edge

7 0.63982898 353 hunch net-2009-05-08-Computability in Artificial Intelligence

8 0.63649571 251 hunch net-2007-06-24-Interesting Papers at ICML 2007

9 0.63409853 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

10 0.63317746 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

11 0.62416995 14 hunch net-2005-02-07-The State of the Reduction

12 0.62223411 233 hunch net-2007-02-16-The Forgetting

13 0.62027735 12 hunch net-2005-02-03-Learning Theory, by assumption

14 0.6183939 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

15 0.61360639 259 hunch net-2007-08-19-Choice of Metrics

16 0.61328465 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

17 0.61264277 19 hunch net-2005-02-14-Clever Methods of Overfitting

18 0.60975701 235 hunch net-2007-03-03-All Models of Learning have Flaws

19 0.60962051 194 hunch net-2006-07-11-New Models

20 0.60777748 170 hunch net-2006-04-06-Bounds greater than 1