hunch_net hunch_net-2008 hunch_net-2008-289 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused. Confidence = Probability . For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5 , but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain. Classical Confidence Intervals . These a
sentIndex sentText sentNum sentScore
1 For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. [sent-5, score-1.231]
2 This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. [sent-7, score-0.611]
3 Given observations from the world (such as err-or-not on examples), an interval is constructed around the hidden value. [sent-15, score-0.696]
4 The semantics of the classical confidence interval is: the (random) interval contains the (determistic but unknown) value, with high probability. [sent-16, score-1.762]
5 Classical confidence intervals (as applied in machine learning) typically require that observations are independent. [sent-17, score-1.159]
6 One drawback of concern is that classical confidence intervals breakdown rapidly when conditioning on information. [sent-19, score-1.337]
7 If you have a prior distribution over the way the world creates observations, then you can use Bayes law to construct a posterior distribution over the way the world creates observations. [sent-22, score-0.737]
8 With respect to this posterior distribution, you construct an interval containing the truth with high probability. [sent-23, score-0.695]
9 The semantics of a Bayesian confidence interval is “If the world is drawn from the prior the interval contains the truth with high probability”. [sent-24, score-1.89]
10 Unlike classical confidence intervals, it’s easy to have a statement conditioned on features. [sent-26, score-0.815]
11 My principal source of uneasiness with respect to Bayesian confidence intervals is the “If the world is drawn from the prior” clause—I believe it is difficult to know and specify a correct prior distribution. [sent-29, score-1.309]
12 Many Bayesians aren’t bothered by this, but the meaning of a Bayesian confidence interval becomes unclear if you work with an incorrect (or subjective) prior. [sent-30, score-1.054]
13 The basic line of reasoning seems to be: “Someone once told me that if observations are IID, then their average converges to a normal distribution, so let’s use an unbiased estimate of the mean and variance, assume convergence, and then construct a confidence interval for the mean of a gaussian”. [sent-33, score-1.203]
14 Asymptotic intervals are asymptotically equivalent to classical confidence intervals, but they can differ spectacularly with finite sample sizes. [sent-34, score-1.262]
15 A classical confidence interval for the error rate is [0,log(1/d)/n] where n is the size of the test set and d is the probability that the interval contains the truth. [sent-36, score-1.884]
16 For asymptotic intervals you get [0,0] which is bogus in all applications I’ve encountered. [sent-37, score-0.664]
17 The essential idea, is that we cease to make intervals about the world, and instead make intervals around our predictions of the world. [sent-40, score-1.187]
18 A basic question is: can this notion of internal confidence guide other forms of exploration? [sent-44, score-0.641]
19 In this setting, a confidence interval is (roughly) a set of predictions output by an adaptive rule with the property that it contains the true observation a large fraction of the time. [sent-47, score-1.08]
20 This approach has yet to catch on, but it is interesting because it provides a feature dependent confidence interval without making strong assumptions about the world. [sent-48, score-0.927]
wordName wordTfidf (topN-words)
[('confidence', 0.528), ('intervals', 0.518), ('interval', 0.399), ('classical', 0.216), ('world', 0.149), ('probability', 0.138), ('contains', 0.113), ('observations', 0.113), ('asymptotic', 0.111), ('truth', 0.081), ('construct', 0.077), ('prior', 0.072), ('distribution', 0.067), ('bayesian', 0.066), ('semantics', 0.06), ('label', 0.059), ('guide', 0.058), ('posterior', 0.056), ('internal', 0.055), ('provided', 0.055), ('creates', 0.05), ('high', 0.047), ('meaning', 0.046), ('error', 0.046), ('event', 0.046), ('unclear', 0.046), ('rate', 0.045), ('given', 0.044), ('mean', 0.043), ('drawn', 0.042), ('predictions', 0.04), ('reported', 0.04), ('conditioning', 0.04), ('cease', 0.04), ('clause', 0.04), ('inadequate', 0.037), ('confident', 0.037), ('observing', 0.037), ('bayesians', 0.037), ('worried', 0.037), ('essential', 0.036), ('easy', 0.036), ('around', 0.035), ('common', 0.035), ('actually', 0.035), ('incorrect', 0.035), ('bogus', 0.035), ('containing', 0.035), ('conditioned', 0.035), ('breakdown', 0.035)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 289 hunch net-2008-02-17-The Meaning of Confidence
Introduction: In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused. Confidence = Probability . For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5 , but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain. Classical Confidence Intervals . These a
2 0.75520468 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.
Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn
3 0.13572675 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
4 0.13271001 145 hunch net-2005-12-29-Deadline Season
Introduction: Many different paper deadlines are coming up soon so I made a little reference table. Out of curiosity, I also computed the interval between submission deadline and conference. Conference Location Date Deadline interval COLT Pittsburgh June 22-25 January 21 152 ICML Pittsburgh June 26-28 January 30/February 6 140 UAI MIT July 13-16 March 9/March 16 119 AAAI Boston July 16-20 February 16/21 145 KDD Philadelphia August 23-26 March 3/March 10 166 It looks like the northeastern US is the big winner as far as location this year.
5 0.12937094 309 hunch net-2008-07-10-Interesting papers, ICML 2008
Introduction: Here are some papers from ICML 2008 that I found interesting. Risi Kondor and Karsten Borgwardt , The Skew Spectrum of Graphs . This paper is about a new family of functions on graphs which is invariant under node label permutation. They show that these quantities appear to yield good features for learning. Sanjoy Dasgupta and Daniel Hsu . Hierarchical sampling for active learning. This is the first published practical consistent active learning algorithm. The abstract is also pretty impressive. Lihong Li , Michael Littman , and Thomas Walsh Knows What It Knows: A Framework For Self-Aware Learning. This is an attempt to create learning algorithms that know when they err, (other work includes Vovk ). It’s not yet clear to me what the right model for feature-dependent confidence intervals is. Novi Quadrianto , Alex Smola , TIberio Caetano , and Quoc Viet Le Estimating Labels from Label Proportions . This is an example of learning in a speciali
6 0.11081854 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning
7 0.10969181 282 hunch net-2008-01-06-Research Political Issues
8 0.10173588 235 hunch net-2007-03-03-All Models of Learning have Flaws
9 0.10092787 34 hunch net-2005-03-02-Prior, “Prior” and Bias
10 0.1008238 2 hunch net-2005-01-24-Holy grails of machine learning?
11 0.09917599 165 hunch net-2006-03-23-The Approximation Argument
12 0.095200151 41 hunch net-2005-03-15-The State of Tight Bounds
13 0.087082326 204 hunch net-2006-08-28-Learning Theory standards for NIPS 2006
14 0.085758872 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
15 0.085421138 19 hunch net-2005-02-14-Clever Methods of Overfitting
16 0.08239194 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
17 0.079603791 259 hunch net-2007-08-19-Choice of Metrics
18 0.075574607 244 hunch net-2007-05-09-The Missing Bound
19 0.071309336 218 hunch net-2006-11-20-Context and the calculation misperception
20 0.070610672 258 hunch net-2007-08-12-Exponentiated Gradient
topicId topicWeight
[(0, 0.162), (1, 0.112), (2, 0.032), (3, -0.043), (4, -0.001), (5, -0.085), (6, 0.038), (7, 0.057), (8, 0.125), (9, -0.07), (10, 0.142), (11, 0.075), (12, 0.137), (13, -0.065), (14, 0.018), (15, -0.177), (16, -0.274), (17, -0.007), (18, -0.029), (19, 0.035), (20, -0.012), (21, 0.21), (22, 0.035), (23, 0.176), (24, -0.147), (25, 0.099), (26, -0.236), (27, -0.16), (28, -0.176), (29, 0.172), (30, 0.105), (31, 0.042), (32, -0.113), (33, -0.094), (34, 0.059), (35, -0.012), (36, -0.138), (37, -0.126), (38, -0.022), (39, -0.191), (40, 0.185), (41, 0.134), (42, 0.021), (43, -0.136), (44, 0.054), (45, -0.084), (46, -0.025), (47, -0.027), (48, 0.072), (49, 0.021)]
simIndex simValue blogId blogTitle
same-blog 1 0.97921413 289 hunch net-2008-02-17-The Meaning of Confidence
Introduction: In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused. Confidence = Probability . For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5 , but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain. Classical Confidence Intervals . These a
2 0.96323806 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.
Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn
3 0.40523297 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
4 0.39286894 169 hunch net-2006-04-05-What is state?
Introduction: In reinforcement learning (and sometimes other settings), there is a notion of “state”. Based upon the state various predictions are made such as “Which action should be taken next?” or “How much cumulative reward do I expect if I take some action from this state?” Given the importance of state, it is important to examine the meaning. There are actually several distinct options and it turns out the definition variation is very important in motivating different pieces of work. Newtonian State. State is the physical pose of the world. Under this definition, there are very many states, often too many for explicit representation. This is also the definition typically used in games. Abstracted State. State is an abstracted physical state of the world. “Is the door open or closed?” “Are you in room A or not?” The number of states is much smaller here. A basic issue here is: “How do you compute the state from observations?” Mathematical State. State is a sufficient stati
5 0.34979984 282 hunch net-2008-01-06-Research Political Issues
Introduction: I’ve avoided discussing politics here, although not for lack of interest. The problem with discussing politics is that it’s customary for people to say much based upon little information. Nevertheless, politics can have a substantial impact on science (and we might hope for the vice-versa). It’s primary election time in the United States, so the topic is timely, although the issues are not. There are several policy decisions which substantially effect development of science and technology in the US. Education The US has great contrasts in education. The top universities are very good places, yet the grade school education system produces mediocre results. For me, the contrast between a public education and Caltech was bracing. For many others attending Caltech, it clearly was not. Upgrading the k-12 education system in the US is a long-standing chronic problem which I know relatively little about. My own experience is that a basic attitude of “no child unrealized” i
6 0.3310689 176 hunch net-2006-05-01-A conversation between Theo and Pat
7 0.30203941 244 hunch net-2007-05-09-The Missing Bound
8 0.2979846 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
9 0.29045227 183 hunch net-2006-06-14-Explorations of Exploration
10 0.28685227 123 hunch net-2005-10-16-Complexity: It’s all in your head
11 0.27376974 34 hunch net-2005-03-02-Prior, “Prior” and Bias
12 0.26201004 76 hunch net-2005-05-29-Bad ideas
13 0.2486888 39 hunch net-2005-03-10-Breaking Abstractions
14 0.24765207 2 hunch net-2005-01-24-Holy grails of machine learning?
15 0.24558003 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?
16 0.24481861 218 hunch net-2006-11-20-Context and the calculation misperception
17 0.2391261 41 hunch net-2005-03-15-The State of Tight Bounds
18 0.23824753 118 hunch net-2005-10-07-On-line learning of regular decision rules
19 0.23788375 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets
20 0.23607914 309 hunch net-2008-07-10-Interesting papers, ICML 2008
topicId topicWeight
[(3, 0.216), (8, 0.018), (27, 0.258), (31, 0.012), (38, 0.064), (46, 0.018), (48, 0.018), (53, 0.037), (55, 0.07), (77, 0.029), (94, 0.062), (95, 0.043)]
simIndex simValue blogId blogTitle
1 0.96181965 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.
Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn
2 0.95574278 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.
3 0.95258737 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr
4 0.94528598 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary
Introduction: This problem has been cracked (but not quite completely solved) by Alina , Pradeep , and I . The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here: For the single elimination tournament, we can prove that: For all multiclass problems D , for all learned binary classifiers c , the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log 2 k . Restated: reg multiclass (D,Filter_tree_test(c)) <= reg binary (Filter_tree_train(D),c) Here: Filter_tree_train(D) is the induced binary classification problem Filter_tree_test(c) is the induced multiclass classifier. reg multiclass is the multiclass regret (= difference between error rate and minim
same-blog 5 0.9374575 289 hunch net-2008-02-17-The Meaning of Confidence
Introduction: In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused. Confidence = Probability . For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5 , but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain. Classical Confidence Intervals . These a
6 0.93619877 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics
7 0.90155685 183 hunch net-2006-06-14-Explorations of Exploration
8 0.84431052 484 hunch net-2013-06-16-Representative Reviewing
9 0.83506936 223 hunch net-2006-12-06-The Spam Problem
10 0.83292115 259 hunch net-2007-08-19-Choice of Metrics
11 0.83221841 34 hunch net-2005-03-02-Prior, “Prior” and Bias
12 0.81938654 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
13 0.81223422 401 hunch net-2010-06-20-2010 ICML discussion site
14 0.810543 41 hunch net-2005-03-15-The State of Tight Bounds
15 0.80753821 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?
16 0.80683029 194 hunch net-2006-07-11-New Models
17 0.80603987 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
18 0.80598551 351 hunch net-2009-05-02-Wielding a New Abstraction
19 0.79911816 67 hunch net-2005-05-06-Don’t mix the solution into the problem
20 0.79827237 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification