hunch_net hunch_net-2007 hunch_net-2007-244 knowledge-graph by maker-knowledge-mining

244 hunch net-2007-05-09-The Missing Bound


meta infos for this blog

Source: html

Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Sham Kakade points out that we are missing a bound. [sent-1, score-0.069]

2 Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0. [sent-3, score-1.506]

3 If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). [sent-6, score-1.683]

4 This can be a huge improvment when the true variance small. [sent-7, score-0.609]

5 What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. [sent-8, score-1.81]

6 This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. [sent-9, score-1.961]

7 Most of the time, they get away with it, but this is a dangerous move when doing machine learning. [sent-10, score-0.132]

8 In machine learning, we are typically trying to find a predictor with 0 expected loss. [sent-11, score-0.058]

9 0 training error) implies an observed variance of 0. [sent-14, score-0.728]

10 Plugging this into Bennett’s bound, you can construct a wildly overconfident bound on the expected loss. [sent-15, score-0.693]

11 One safe way to apply Bennett’s bound is to use McDiarmid’s inequality to bound the true variance given an observed variance, and then plug this bound on the true variance into Bennett’s bound (making sure to share the confidence parameter between both applications) on the mean. [sent-16, score-3.775]

12 This is a clumsy and relatively inelegant method. [sent-17, score-0.054]

13 If we let the observed mean of a sample S be u(S) and the observed variance be v(S) , there should exist a bound which requires only a bounded range (like Chernoff), yet which is almost as tight as the Bennett bound. [sent-19, score-1.823]

14 It should have the form: Pr S ~ D m ( E x~D x <= f(u(S), v(S) ,d)) >= 1 – d For machine learning, a bound of this form may help design learning algorithms which learn by directly optimizing bounds. [sent-20, score-0.579]

15 However, there are many other applications both within and beyond machine learning. [sent-21, score-0.149]

16 (*) Incidentally, sometimes people try to apply the Bennett inequality when they only know the range of the random variable by computing the worst case variance within that range. [sent-22, score-1.019]

17 This is never as good as a proper application of the Chernoff/Hoeffding bound. [sent-23, score-0.047]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('variance', 0.504), ('bennett', 0.491), ('bound', 0.483), ('observed', 0.224), ('range', 0.168), ('bounded', 0.119), ('true', 0.105), ('plugging', 0.1), ('inequality', 0.1), ('apply', 0.071), ('missing', 0.069), ('within', 0.063), ('incorrectly', 0.062), ('overconfident', 0.062), ('plug', 0.062), ('deviations', 0.062), ('invalidating', 0.062), ('magic', 0.062), ('exist', 0.059), ('form', 0.059), ('expected', 0.058), ('chernoff', 0.058), ('clumsy', 0.054), ('safe', 0.054), ('dangerous', 0.052), ('crude', 0.052), ('interval', 0.052), ('wildly', 0.05), ('pr', 0.05), ('applications', 0.049), ('incidentally', 0.047), ('depends', 0.047), ('proper', 0.047), ('kakade', 0.045), ('tighter', 0.045), ('move', 0.043), ('sham', 0.042), ('tight', 0.042), ('construct', 0.04), ('parameter', 0.04), ('moment', 0.039), ('variable', 0.039), ('exponential', 0.038), ('worst', 0.038), ('confidence', 0.037), ('optimizing', 0.037), ('away', 0.037), ('beyond', 0.037), ('share', 0.037), ('computing', 0.036)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 244 hunch net-2007-05-09-The Missing Bound

Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,

2 0.18510951 41 hunch net-2005-03-15-The State of Tight Bounds

Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec

3 0.18183425 392 hunch net-2010-03-26-A Variance only Deviation Bound

Introduction: At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable . For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance. The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by

4 0.16690059 55 hunch net-2005-04-10-Is the Goal Understanding or Prediction?

Introduction: Steve Smale and I have a debate about goals of learning theory. Steve likes theorems with a dependence on unobservable quantities. For example, if D is a distribution over a space X x [0,1] , you can state a theorem about the error rate dependent on the variance, E (x,y)~D (y-E y’~D|x [y']) 2 . I dislike this, because I want to use the theorems to produce code solving learning problems. Since I don’t know (and can’t measure) the variance, a theorem depending on the variance does not help me—I would not know what variance to plug into the learning algorithm. Recast more broadly, this is a debate between “declarative” and “operative” mathematics. A strong example of “declarative” mathematics is “a new kind of science” . Roughly speaking, the goal of this kind of approach seems to be finding a way to explain the observations we make. Examples include “some things are unpredictable”, “a phase transition exists”, etc… “Operative” mathematics helps you make predictions a

5 0.16180462 85 hunch net-2005-06-28-A COLT paper

Introduction: I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.

6 0.15547876 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

7 0.1460541 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science

8 0.13013148 163 hunch net-2006-03-12-Online learning or online preservation of learning?

9 0.1181535 368 hunch net-2009-08-26-Another 10-year paper in Machine Learning

10 0.11569314 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.

11 0.095021561 83 hunch net-2005-06-18-Lower Bounds for Learning Reductions

12 0.09353447 170 hunch net-2006-04-06-Bounds greater than 1

13 0.092794403 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

14 0.087616012 58 hunch net-2005-04-21-Dynamic Programming Generalizations and Their Use

15 0.082197465 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

16 0.078506649 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

17 0.075574607 289 hunch net-2008-02-17-The Meaning of Confidence

18 0.062598094 259 hunch net-2007-08-19-Choice of Metrics

19 0.062565424 343 hunch net-2009-02-18-Decision by Vetocracy

20 0.060506161 235 hunch net-2007-03-03-All Models of Learning have Flaws


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.12), (1, 0.083), (2, 0.038), (3, -0.035), (4, 0.008), (5, -0.058), (6, 0.085), (7, 0.01), (8, 0.002), (9, -0.044), (10, 0.093), (11, 0.14), (12, 0.049), (13, -0.066), (14, 0.019), (15, -0.078), (16, -0.077), (17, 0.05), (18, -0.105), (19, 0.043), (20, 0.115), (21, 0.113), (22, -0.209), (23, 0.047), (24, -0.035), (25, 0.086), (26, -0.035), (27, 0.037), (28, -0.022), (29, 0.003), (30, -0.066), (31, -0.04), (32, -0.001), (33, 0.014), (34, -0.103), (35, -0.048), (36, -0.067), (37, 0.008), (38, -0.094), (39, 0.021), (40, -0.086), (41, -0.03), (42, -0.066), (43, 0.013), (44, -0.029), (45, -0.053), (46, 0.057), (47, 0.026), (48, -0.02), (49, -0.081)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9871105 244 hunch net-2007-05-09-The Missing Bound

Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,

2 0.78029746 85 hunch net-2005-06-28-A COLT paper

Introduction: I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.

3 0.75018805 170 hunch net-2006-04-06-Bounds greater than 1

Introduction: Nati Srebro and Shai Ben-David have a paper at COLT which, in the appendix, proves something very striking: several previous error bounds are always greater than 1. Background One branch of learning theory focuses on theorems which Assume samples are drawn IID from an unknown distribution D . Fix a set of classifiers Find a high probability bound on the maximum true error rate (with respect to D ) as a function of the empirical error rate on the training set. Many of these bounds become extremely complex and hairy. Current Everyone working on this subject wants “tighter bounds”, however there are different definitions of “tighter”. Some groups focus on “functional tightness” (getting the right functional dependency between the size of the training set and a parameterization of the hypothesis space) while others focus on “practical tightness” (finding bounds which work well on practical problems). (I am definitely in the second camp.) One of the da

4 0.73171586 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science

Introduction: “Science” has many meanings, but one common meaning is “the scientific method ” which is a principled method for investigating the world using the following steps: Form a hypothesis about the world. Use the hypothesis to make predictions. Run experiments to confirm or disprove the predictions. The ordering of these steps is very important to the scientific method. In particular, predictions must be made before experiments are run. Given that we all believe in the scientific method of investigation, it may be surprising to learn that cheating is very common. This happens for many reasons, some innocent and some not. Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here : including “reprobleming”, “data set selection”, and probably “overfitting by review”

5 0.72971857 41 hunch net-2005-03-15-The State of Tight Bounds

Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec

6 0.72368729 392 hunch net-2010-03-26-A Variance only Deviation Bound

7 0.61401498 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

8 0.58229285 55 hunch net-2005-04-10-Is the Goal Understanding or Prediction?

9 0.50812066 58 hunch net-2005-04-21-Dynamic Programming Generalizations and Their Use

10 0.50640929 163 hunch net-2006-03-12-Online learning or online preservation of learning?

11 0.45773011 368 hunch net-2009-08-26-Another 10-year paper in Machine Learning

12 0.42399019 26 hunch net-2005-02-21-Problem: Cross Validation

13 0.39848372 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.

14 0.38527545 83 hunch net-2005-06-18-Lower Bounds for Learning Reductions

15 0.35003754 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

16 0.34590942 33 hunch net-2005-02-28-Regularization

17 0.33978313 289 hunch net-2008-02-17-The Meaning of Confidence

18 0.33827898 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

19 0.3359735 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

20 0.3277669 185 hunch net-2006-06-16-Regularization = Robustness


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.237), (3, 0.085), (9, 0.015), (10, 0.036), (27, 0.325), (38, 0.043), (53, 0.022), (55, 0.01), (56, 0.011), (94, 0.084)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.90978795 244 hunch net-2007-05-09-The Missing Bound

Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,

2 0.84212327 47 hunch net-2005-03-28-Open Problems for Colt

Introduction: Adam Klivans and Rocco Servedio are looking for open (learning theory) problems for COLT . This is a good idea in the same way that the KDDcup challenge is a good idea: crisp problem definitions that anyone can attack yield solutions that advance science.

3 0.81244349 207 hunch net-2006-09-12-Incentive Compatible Reviewing

Introduction: Reviewing is a fairly formal process which is integral to the way academia is run. Given this integral nature, the quality of reviewing is often frustrating. I’ve seen plenty of examples of false statements, misbeliefs, reading what isn’t written, etc…, and I’m sure many other people have as well. Recently, mechanisms like double blind review and author feedback have been introduced to try to make the process more fair and accurate in many machine learning (and related) conferences. My personal experience is that these mechanisms help, especially the author feedback. Nevertheless, some problems remain. The game theory take on reviewing is that the incentive for truthful reviewing isn’t there. Since reviewers are also authors, there are sometimes perverse incentives created and acted upon. (Incidentially, these incentives can be both positive and negative.) Setting up a truthful reviewing system is tricky because their is no final reference truth available in any acce

4 0.79891378 289 hunch net-2008-02-17-The Meaning of Confidence

Introduction: In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused. Confidence = Probability . For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5 , but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain. Classical Confidence Intervals . These a

5 0.79629809 183 hunch net-2006-06-14-Explorations of Exploration

Introduction: Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models. Reinforcement Learning (1) . Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a . The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E 3 by Satinder Singh and Michael Kearns . Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E 3 and

6 0.79624182 45 hunch net-2005-03-22-Active learning

7 0.7934882 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

8 0.79257673 352 hunch net-2009-05-06-Machine Learning to AI

9 0.79083776 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

10 0.79017764 41 hunch net-2005-03-15-The State of Tight Bounds

11 0.78652668 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

12 0.78651291 252 hunch net-2007-07-01-Watchword: Online Learning

13 0.78565925 258 hunch net-2007-08-12-Exponentiated Gradient

14 0.78426576 9 hunch net-2005-02-01-Watchword: Loss

15 0.78393036 400 hunch net-2010-06-13-The Good News on Exploration and Learning

16 0.78304964 245 hunch net-2007-05-12-Loss Function Semantics

17 0.78266537 67 hunch net-2005-05-06-Don’t mix the solution into the problem

18 0.78213966 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

19 0.78099829 259 hunch net-2007-08-19-Choice of Metrics

20 0.77959698 304 hunch net-2008-06-27-Reviewing Horror Stories