hunch_net hunch_net-2006 hunch_net-2006-213 knowledge-graph by maker-knowledge-mining

213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.


meta infos for this blog

Source: html

Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . [sent-1, score-1.47]

2 The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. [sent-2, score-0.576]

3 There are even some confidence intervals which do not require IID samples. [sent-3, score-1.06]

4 They do not say “with high probability, for my observed sample, the bounds holds”. [sent-5, score-0.164]

5 Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. [sent-6, score-1.006]

6 Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. [sent-7, score-0.284]

7 There are a number of results in theoretical machine learning which use confidence intervals. [sent-8, score-0.646]

8 For example, The E 3 algorithm uses confidence intervals to learn a near optimal policy for any MDP with high probability. [sent-9, score-1.214]

9 Set Covering Machines minimize a confidence interval upper bound on the true error rate of a learned classifier. [sent-10, score-0.939]

10 The A 2 uses confidence intervals to safely deal with arbitrary noise while taking advantage of active learning. [sent-11, score-1.268]

11 Suppose that we want to generalize thse algorithms in a reductive style. [sent-12, score-0.059]

12 The goal would be to train a regressor to predict the output of g(S) for new situations. [sent-13, score-0.109]

13 For example, a good regression prediction of g(S) might allow E 3 to be applied to much larger state spaces. [sent-14, score-0.184]

14 Unfortunately, this approach seems to fail badly due to a mismatch between the semantics of learning and the semantics of a classical confidence interval. [sent-15, score-1.11]

15 It’s difficult to imagine a constructive sampling mechanism. [sent-16, score-0.052]

16 In a large state space, we may never encounter the same state twice, so we can not form meaningful examples of the form “for this state-action, the correct confidence interval is y “. [sent-17, score-1.281]

17 When we think of succesful learning, we typically think of it in an l 1 sense—the expected error rate over the data generating distribution is small. [sent-18, score-0.206]

18 Confidence intervals have a much stronger meaning as we would like to apply them: with high probability, in all applications, the confidence interval holds. [sent-19, score-1.395]

19 It is tempting to start plugging in other notions such as Bayesian confidence intervals or quantile regression systems. [sent-21, score-1.381]

20 Making these approaches work at a theoretical level on even simple systems is an open problem, but there is plenty of motivation to do so. [sent-22, score-0.067]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('confidence', 0.579), ('intervals', 0.481), ('interval', 0.247), ('classical', 0.172), ('mismatch', 0.137), ('iid', 0.12), ('semantics', 0.111), ('state', 0.104), ('tell', 0.096), ('distribution', 0.093), ('high', 0.088), ('arbitrary', 0.085), ('constraints', 0.084), ('mean', 0.08), ('regression', 0.08), ('observed', 0.076), ('samples', 0.074), ('form', 0.07), ('safety', 0.069), ('sample', 0.068), ('theoretical', 0.067), ('uses', 0.066), ('notions', 0.065), ('confuse', 0.065), ('quantile', 0.065), ('safe', 0.065), ('probability', 0.064), ('identically', 0.062), ('plugging', 0.059), ('generalize', 0.059), ('pr', 0.059), ('regressor', 0.059), ('satisfy', 0.059), ('satisfied', 0.057), ('safely', 0.057), ('mdp', 0.057), ('meaningful', 0.057), ('error', 0.057), ('function', 0.057), ('rate', 0.056), ('twice', 0.055), ('vary', 0.054), ('procedure', 0.054), ('holds', 0.052), ('tempting', 0.052), ('constructive', 0.052), ('restated', 0.051), ('encounter', 0.05), ('gaussian', 0.05), ('train', 0.05)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.

Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn

2 0.75520468 289 hunch net-2008-02-17-The Meaning of Confidence

Introduction: In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused. Confidence = Probability . For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5 , but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain. Classical Confidence Intervals . These a

3 0.12814218 41 hunch net-2005-03-15-The State of Tight Bounds

Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec

4 0.11894008 204 hunch net-2006-08-28-Learning Theory standards for NIPS 2006

Introduction: Bob Williamson and I are the learning theory PC members at NIPS this year. This is some attempt to state the standards and tests I applied to the papers. I think it is a good idea to talk about this for two reasons: Making community standards a matter of public record seems healthy. It give us a chance to debate what is and is not the right standard. It might even give us a bit more consistency across the years. It may save us all time. There are a number of papers submitted which just aren’t there yet. Avoiding submitting is the right decision in this case. There are several criteria for judging a paper. All of these were active this year. Some criteria are uncontroversial while others may be so. The paper must have a theorem establishing something new for which it is possible to derive high confidence in the correctness of the results. A surprising number of papers fail this test. This criteria seems essential to the definition of “theory”. Missing theo

5 0.11569314 244 hunch net-2007-05-09-The Missing Bound

Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,

6 0.11475682 183 hunch net-2006-06-14-Explorations of Exploration

7 0.11352672 309 hunch net-2008-07-10-Interesting papers, ICML 2008

8 0.10591957 235 hunch net-2007-03-03-All Models of Learning have Flaws

9 0.10536145 282 hunch net-2008-01-06-Research Political Issues

10 0.10088301 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

11 0.096483737 2 hunch net-2005-01-24-Holy grails of machine learning?

12 0.094079375 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

13 0.091605559 19 hunch net-2005-02-14-Clever Methods of Overfitting

14 0.090011582 259 hunch net-2007-08-19-Choice of Metrics

15 0.089794397 169 hunch net-2006-04-05-What is state?

16 0.087212659 5 hunch net-2005-01-26-Watchword: Probability

17 0.086520016 12 hunch net-2005-02-03-Learning Theory, by assumption

18 0.08365202 245 hunch net-2007-05-12-Loss Function Semantics

19 0.081357308 145 hunch net-2005-12-29-Deadline Season

20 0.078140534 351 hunch net-2009-05-02-Wielding a New Abstraction


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.171), (1, 0.129), (2, 0.048), (3, -0.058), (4, 0.005), (5, -0.082), (6, 0.064), (7, 0.054), (8, 0.093), (9, -0.073), (10, 0.144), (11, 0.092), (12, 0.119), (13, -0.053), (14, -0.002), (15, -0.184), (16, -0.262), (17, 0.007), (18, -0.049), (19, 0.024), (20, 0.016), (21, 0.219), (22, -0.014), (23, 0.18), (24, -0.148), (25, 0.096), (26, -0.22), (27, -0.155), (28, -0.173), (29, 0.148), (30, 0.112), (31, 0.033), (32, -0.137), (33, -0.109), (34, 0.063), (35, -0.014), (36, -0.11), (37, -0.127), (38, 0.003), (39, -0.233), (40, 0.18), (41, 0.134), (42, 0.057), (43, -0.137), (44, 0.035), (45, -0.096), (46, -0.035), (47, 0.002), (48, 0.072), (49, -0.018)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98283374 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.

Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn

2 0.96401078 289 hunch net-2008-02-17-The Meaning of Confidence

Introduction: In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused. Confidence = Probability . For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5 , but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain. Classical Confidence Intervals . These a

3 0.42264065 169 hunch net-2006-04-05-What is state?

Introduction: In reinforcement learning (and sometimes other settings), there is a notion of “state”. Based upon the state various predictions are made such as “Which action should be taken next?” or “How much cumulative reward do I expect if I take some action from this state?” Given the importance of state, it is important to examine the meaning. There are actually several distinct options and it turns out the definition variation is very important in motivating different pieces of work. Newtonian State. State is the physical pose of the world. Under this definition, there are very many states, often too many for explicit representation. This is also the definition typically used in games. Abstracted State. State is an abstracted physical state of the world. “Is the door open or closed?” “Are you in room A or not?” The number of states is much smaller here. A basic issue here is: “How do you compute the state from observations?” Mathematical State. State is a sufficient stati

4 0.34963098 244 hunch net-2007-05-09-The Missing Bound

Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,

5 0.34127006 176 hunch net-2006-05-01-A conversation between Theo and Pat

Introduction: Pat (the practitioner) I need to do multiclass classification and I only have a decision tree. Theo (the thoeretician) Use an error correcting output code . Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them. Theo Oh? Is your problem a decision list? Pat No, I don’t think so. Theo Hmm. Are the classes well separated by axis aligned splits? Pat Err, maybe. I’m not sure. Theo Well, if they are, under the IID assumption I can tell you how many samples you need. Pat IID? The data is definitely not IID. Theo Oh dear. Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other. Theo Ok. What do you know about your problem? Pat Not much. My friend just gave me the dataset. Theo Then, no one can help you. Pat (What a fuzzy thinker. Theo keeps jumping t

6 0.33671761 5 hunch net-2005-01-26-Watchword: Probability

7 0.33256018 282 hunch net-2008-01-06-Research Political Issues

8 0.32047549 183 hunch net-2006-06-14-Explorations of Exploration

9 0.29574355 41 hunch net-2005-03-15-The State of Tight Bounds

10 0.28844833 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

11 0.28752461 2 hunch net-2005-01-24-Holy grails of machine learning?

12 0.26416427 76 hunch net-2005-05-29-Bad ideas

13 0.25449055 384 hunch net-2009-12-24-Top graduates this season

14 0.25417745 123 hunch net-2005-10-16-Complexity: It’s all in your head

15 0.25213844 319 hunch net-2008-10-01-NIPS 2008 workshop on ‘Learning over Empirical Hypothesis Spaces’

16 0.25206476 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

17 0.24720906 218 hunch net-2006-11-20-Context and the calculation misperception

18 0.24591367 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

19 0.24487191 26 hunch net-2005-02-21-Problem: Cross Validation

20 0.23840284 170 hunch net-2006-04-06-Bounds greater than 1


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.386), (27, 0.281), (38, 0.039), (53, 0.025), (55, 0.027), (94, 0.091), (95, 0.038)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.98303407 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary

Introduction: This problem has been cracked (but not quite completely solved) by Alina , Pradeep , and I . The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here: For the single elimination tournament, we can prove that: For all multiclass problems D , for all learned binary classifiers c , the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log 2 k . Restated: reg multiclass (D,Filter_tree_test(c)) <= reg binary (Filter_tree_train(D),c) Here: Filter_tree_train(D) is the induced binary classification problem Filter_tree_test(c) is the induced multiclass classifier. reg multiclass is the multiclass regret (= difference between error rate and minim

2 0.97060084 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr

same-blog 3 0.93836117 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.

Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn

4 0.89596242 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics

Introduction: This, again, is something of a research direction rather than a single problem. There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these. Problem What does the ability to classify well imply about performance under these metrics? Past Work Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC. Sample complexity bounds for AROC (and here ). A paper on “ Learning to Order Things “. Difficulty Several of these may be easy. Some of them may be h

5 0.89517963 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.

6 0.87371969 401 hunch net-2010-06-20-2010 ICML discussion site

7 0.84813696 289 hunch net-2008-02-17-The Meaning of Confidence

8 0.79116774 183 hunch net-2006-06-14-Explorations of Exploration

9 0.71019322 484 hunch net-2013-06-16-Representative Reviewing

10 0.69398338 34 hunch net-2005-03-02-Prior, “Prior” and Bias

11 0.67531234 259 hunch net-2007-08-19-Choice of Metrics

12 0.67480421 67 hunch net-2005-05-06-Don’t mix the solution into the problem

13 0.67278898 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

14 0.67115009 41 hunch net-2005-03-15-The State of Tight Bounds

15 0.67060351 78 hunch net-2005-06-06-Exact Online Learning for Classification

16 0.66602254 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

17 0.66494405 244 hunch net-2007-05-09-The Missing Bound

18 0.66163677 252 hunch net-2007-07-01-Watchword: Online Learning

19 0.66158336 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

20 0.65895194 129 hunch net-2005-11-07-Prediction Competitions