hunch_net hunch_net-2007 hunch_net-2007-247 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Here are two papers that seem particularly interesting at this year’s COLT. Gilles Blanchard and François Fleuret , Occam’s Hammer . When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper. Adam Tauman Kalai . Learning Nested Halfspaces and Uphill Decision Trees . Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily ch
sentIndex sentText sentNum sentScore
1 When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . [sent-3, score-0.738]
2 A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. [sent-4, score-1.419]
3 This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. [sent-5, score-1.183]
4 The ability to safely use a single classifier is very nice. [sent-6, score-0.555]
5 This technique applies generically to any base bound, so it has other applications covered in the paper. [sent-7, score-0.638]
6 Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily challenging as judged by lack of progress over a long period of time. [sent-10, score-0.894]
7 This paper is about regression PAC-learning, and the results appear much more encouraging than exist in classification PAC-learning. [sent-11, score-0.443]
8 Under the assumption that: The level sets of the correct regressed value are halfspaces. [sent-12, score-0.447]
9 this paper proves that a good regressor can be PAC-learned using a boosting algorithm. [sent-14, score-0.443]
10 (The “uphill decision trees” part of the paper is about one special case where you don’t need the Lipschitz condition. [sent-15, score-0.321]
wordName wordTfidf (topN-words)
[('lipschitz', 0.318), ('uphill', 0.318), ('classifier', 0.26), ('bound', 0.258), ('tight', 0.19), ('single', 0.186), ('sets', 0.179), ('applies', 0.179), ('base', 0.17), ('trees', 0.164), ('paper', 0.134), ('halfspaces', 0.131), ('condition', 0.123), ('generically', 0.123), ('level', 0.118), ('encouraging', 0.118), ('classification', 0.115), ('judged', 0.113), ('occam', 0.113), ('learnable', 0.113), ('proves', 0.113), ('regressor', 0.113), ('randomized', 0.109), ('kalai', 0.109), ('safely', 0.109), ('period', 0.106), ('disadvantage', 0.103), ('challenging', 0.103), ('holds', 0.1), ('tempting', 0.1), ('adam', 0.097), ('extraordinarily', 0.097), ('proved', 0.097), ('decision', 0.097), ('covered', 0.091), ('special', 0.09), ('boosting', 0.083), ('classifiers', 0.08), ('empirically', 0.077), ('assumption', 0.076), ('regression', 0.076), ('technique', 0.075), ('drawn', 0.074), ('shows', 0.074), ('prove', 0.074), ('set', 0.074), ('correct', 0.074), ('input', 0.074), ('progress', 0.071), ('lack', 0.069)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 247 hunch net-2007-06-14-Interesting Papers at COLT 2007
Introduction: Here are two papers that seem particularly interesting at this year’s COLT. Gilles Blanchard and François Fleuret , Occam’s Hammer . When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper. Adam Tauman Kalai . Learning Nested Halfspaces and Uphill Decision Trees . Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily ch
2 0.17266485 41 hunch net-2005-03-15-The State of Tight Bounds
Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec
3 0.1575183 85 hunch net-2005-06-28-A COLT paper
Introduction: I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.
4 0.15707658 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r
5 0.15547876 244 hunch net-2007-05-09-The Missing Bound
Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,
6 0.14983985 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
7 0.14922437 26 hunch net-2005-02-21-Problem: Cross Validation
8 0.13276488 274 hunch net-2007-11-28-Computational Consequences of Classification
9 0.12081455 14 hunch net-2005-02-07-The State of the Reduction
10 0.11442 43 hunch net-2005-03-18-Binomial Weighting
11 0.11353772 170 hunch net-2006-04-06-Bounds greater than 1
12 0.11239773 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science
13 0.10108186 220 hunch net-2006-11-27-Continuizing Solutions
14 0.10080582 83 hunch net-2005-06-18-Lower Bounds for Learning Reductions
15 0.099963218 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009
16 0.09952344 131 hunch net-2005-11-16-The Everything Ensemble Edge
17 0.099206686 133 hunch net-2005-11-28-A question of quantification
18 0.098985419 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
19 0.095534243 406 hunch net-2010-08-22-KDD 2010
20 0.092674427 164 hunch net-2006-03-17-Multitask learning is Black-Boxable
topicId topicWeight
[(0, 0.183), (1, 0.099), (2, 0.114), (3, -0.086), (4, 0.047), (5, -0.038), (6, 0.131), (7, -0.028), (8, -0.085), (9, -0.082), (10, 0.027), (11, 0.124), (12, -0.03), (13, -0.132), (14, 0.014), (15, -0.026), (16, -0.028), (17, 0.098), (18, -0.064), (19, 0.004), (20, 0.082), (21, 0.073), (22, -0.101), (23, 0.027), (24, -0.018), (25, 0.022), (26, 0.009), (27, 0.105), (28, 0.017), (29, -0.006), (30, 0.004), (31, -0.093), (32, -0.094), (33, -0.026), (34, 0.036), (35, 0.077), (36, 0.008), (37, -0.036), (38, 0.04), (39, 0.065), (40, -0.011), (41, -0.103), (42, -0.041), (43, -0.013), (44, 0.005), (45, 0.001), (46, 0.047), (47, -0.012), (48, 0.025), (49, -0.119)]
simIndex simValue blogId blogTitle
same-blog 1 0.98164111 247 hunch net-2007-06-14-Interesting Papers at COLT 2007
Introduction: Here are two papers that seem particularly interesting at this year’s COLT. Gilles Blanchard and François Fleuret , Occam’s Hammer . When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper. Adam Tauman Kalai . Learning Nested Halfspaces and Uphill Decision Trees . Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily ch
2 0.71046799 85 hunch net-2005-06-28-A COLT paper
Introduction: I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.
3 0.67890918 41 hunch net-2005-03-15-The State of Tight Bounds
Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec
4 0.65744364 26 hunch net-2005-02-21-Problem: Cross Validation
Introduction: The essential problem here is the large gap between experimental observation and theoretical understanding. Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . For each fold, a classifier is trained on the other folds and then test on the fold. Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate. Past Work (I’ll add more as I remember/learn.) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper . Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . Avrim Blum, Adam Kalai , and myself analyzed cross validation and found tha
5 0.65385097 244 hunch net-2007-05-09-The Missing Bound
Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,
6 0.65357345 170 hunch net-2006-04-06-Bounds greater than 1
7 0.54906219 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
8 0.54690617 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science
9 0.53533328 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics
10 0.52502942 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC
11 0.50855857 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
12 0.5053637 138 hunch net-2005-12-09-Some NIPS papers
13 0.47236761 439 hunch net-2011-08-01-Interesting papers at COLT 2011
14 0.4645322 163 hunch net-2006-03-12-Online learning or online preservation of learning?
15 0.45903438 14 hunch net-2005-02-07-The State of the Reduction
16 0.45108283 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time
17 0.44846097 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009
18 0.44433281 131 hunch net-2005-11-16-The Everything Ensemble Edge
19 0.43639651 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary
20 0.42954537 392 hunch net-2010-03-26-A Variance only Deviation Bound
topicId topicWeight
[(27, 0.853), (55, 0.034)]
simIndex simValue blogId blogTitle
1 0.99980223 274 hunch net-2007-11-28-Computational Consequences of Classification
Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints
2 0.9992137 166 hunch net-2006-03-24-NLPers
Introduction: Hal Daume has started the NLPers blog to discuss learning for language problems.
3 0.9992137 246 hunch net-2007-06-13-Not Posting
Introduction: If you have been disappointed by the lack of a post for the last month, consider contributing your own (I’ve been busy+uninspired). Also, keep in mind that there is a community of machine learning blogs (see the sidebar).
4 0.9992137 418 hunch net-2010-12-02-Traffic Prediction Problem
Introduction: Slashdot points out the Traffic Prediction Challenge which looks pretty fun. The temporal aspect seems to be very common in many real-world problems and somewhat understudied.
same-blog 5 0.99918866 247 hunch net-2007-06-14-Interesting Papers at COLT 2007
Introduction: Here are two papers that seem particularly interesting at this year’s COLT. Gilles Blanchard and François Fleuret , Occam’s Hammer . When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper. Adam Tauman Kalai . Learning Nested Halfspaces and Uphill Decision Trees . Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily ch
6 0.99623722 308 hunch net-2008-07-06-To Dual or Not
7 0.99565452 288 hunch net-2008-02-10-Complexity Illness
8 0.99534428 172 hunch net-2006-04-14-JMLR is a success
9 0.99445117 245 hunch net-2007-05-12-Loss Function Semantics
10 0.99313772 400 hunch net-2010-06-13-The Good News on Exploration and Learning
11 0.98801661 45 hunch net-2005-03-22-Active learning
12 0.97729665 9 hunch net-2005-02-01-Watchword: Loss
13 0.97300833 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
14 0.97230822 352 hunch net-2009-05-06-Machine Learning to AI
15 0.96731687 304 hunch net-2008-06-27-Reviewing Horror Stories
16 0.95887548 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
17 0.9464668 483 hunch net-2013-06-10-The Large Scale Learning class notes
18 0.94335711 244 hunch net-2007-05-09-The Missing Bound
19 0.94119173 293 hunch net-2008-03-23-Interactive Machine Learning
20 0.93265337 67 hunch net-2005-05-06-Don’t mix the solution into the problem