hunch_net hunch_net-2006 hunch_net-2006-218 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
sentIndex sentText sentNum sentScore
1 It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. [sent-2, score-0.28]
2 Hopefully, by precisely stating it, we can remove it. [sent-3, score-0.234]
3 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. [sent-5, score-0.416]
4 One approach is to bin the data by one feature. [sent-6, score-0.188]
5 For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. [sent-7, score-1.265]
6 On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. [sent-8, score-0.572]
7 Using this simple rule we get an observed error rate of 3/7. [sent-9, score-0.72]
8 The first is that this is really a training error rate, and (hence) may be an overoptimistic prediction. [sent-11, score-0.254]
9 This is not a very serious issue as long as there are a reasonable number of representative examples. [sent-12, score-0.272]
10 A simple rule (number of 1′s less than 3 implies 1, else 0) achieves error rate 0. [sent-14, score-0.641]
11 By binning the data according to only one feature, the potential of achieving error rate 0 is removed. [sent-15, score-0.753]
12 The reason for binning is often definitional . [sent-16, score-0.334]
13 Many people think of probability as an observed (or observable) rate. [sent-17, score-0.354]
14 For these people, the probabilities of events can only be learned by finding a large number of identical events and then calculating the observed rate. [sent-18, score-1.359]
15 Constructing “identical events” always involves throwing away the unique context of the event. [sent-19, score-0.317]
16 This disposal of information eliminates the possibility of good prediction performance. [sent-20, score-0.343]
17 There are other definitions of probability which are more appropriate when every event is unique. [sent-22, score-0.177]
18 One thing which makes people uncomfortable about probabilities over unique events is that probabilities are no longer observable—they are only estimatable. [sent-23, score-1.25]
19 This loss of grounding is a price which must be paid for improved performance. [sent-24, score-0.293]
20 Luckily, we can tell if our prediction performance improves on labeled examples. [sent-25, score-0.216]
wordName wordTfidf (topN-words)
[('events', 0.341), ('feature', 0.333), ('probabilities', 0.258), ('binning', 0.223), ('observable', 0.223), ('observed', 0.172), ('error', 0.172), ('rate', 0.168), ('identical', 0.167), ('unique', 0.15), ('suppose', 0.13), ('mostly', 0.13), ('rule', 0.128), ('affects', 0.111), ('definitional', 0.111), ('label', 0.109), ('issue', 0.109), ('according', 0.105), ('disposal', 0.103), ('bin', 0.103), ('grounding', 0.103), ('price', 0.097), ('luckily', 0.097), ('eliminates', 0.097), ('probability', 0.096), ('achieves', 0.093), ('paid', 0.093), ('throwing', 0.093), ('partition', 0.093), ('uncomfortable', 0.089), ('people', 0.086), ('stating', 0.086), ('data', 0.085), ('constructing', 0.083), ('representative', 0.083), ('predict', 0.083), ('really', 0.082), ('definitions', 0.081), ('described', 0.081), ('simple', 0.08), ('number', 0.08), ('remove', 0.079), ('observe', 0.077), ('improves', 0.075), ('involves', 0.074), ('possibility', 0.074), ('tell', 0.072), ('precisely', 0.069), ('prediction', 0.069), ('longer', 0.068)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
2 0.21466592 163 hunch net-2006-03-12-Online learning or online preservation of learning?
Introduction: In the online learning with experts setting, you observe a set of predictions, make a decision, and then observe the truth. This process repeats indefinitely. In this setting, it is possible to prove theorems of the sort: master algorithm error count < = k* best predictor error count + c*log(number of predictors) Is this a statement about learning or about preservation of learning? We did some experiments to analyze the new Binning algorithm which works in this setting. For several UCI datasets, we reprocessed them so that features could be used as predictors and then applied several master algorithms. The first graph confirms that Binning is indeed a better algorithm according to the tightness of the upper bound. Here, “Best” is the performance of the best expert. “V. Bound” is the bound for Vovk ‘s algorithm (the previous best). “Bound” is the bound for the Binning algorithm. “Binning” is the performance of the Binning algorithm. The Binning algorithm clearly h
3 0.17355987 62 hunch net-2005-04-26-To calibrate or not?
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
4 0.1669782 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
5 0.1567149 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
6 0.15261725 330 hunch net-2008-12-07-A NIPS paper
7 0.13765107 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
8 0.11323771 274 hunch net-2007-11-28-Computational Consequences of Classification
9 0.11245245 78 hunch net-2005-06-06-Exact Online Learning for Classification
10 0.10360167 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?
11 0.10353272 35 hunch net-2005-03-04-The Big O and Constants in Learning
12 0.10105764 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
13 0.10079402 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
14 0.097216822 67 hunch net-2005-05-06-Don’t mix the solution into the problem
15 0.096707523 118 hunch net-2005-10-07-On-line learning of regular decision rules
16 0.09389998 160 hunch net-2006-03-02-Why do people count for learning?
17 0.093401454 217 hunch net-2006-11-06-Data Linkage Problems
18 0.091955602 302 hunch net-2008-05-25-Inappropriate Mathematics for Machine Learning
19 0.090844184 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
20 0.088990331 33 hunch net-2005-02-28-Regularization
topicId topicWeight
[(0, 0.199), (1, 0.136), (2, 0.033), (3, -0.046), (4, -0.062), (5, -0.016), (6, 0.037), (7, 0.023), (8, 0.023), (9, -0.08), (10, -0.095), (11, 0.022), (12, 0.078), (13, -0.084), (14, -0.022), (15, -0.078), (16, -0.148), (17, 0.046), (18, 0.038), (19, 0.079), (20, -0.043), (21, 0.078), (22, 0.11), (23, 0.089), (24, 0.027), (25, 0.03), (26, 0.066), (27, 0.021), (28, -0.051), (29, -0.069), (30, -0.033), (31, 0.066), (32, -0.022), (33, -0.011), (34, -0.111), (35, 0.047), (36, 0.082), (37, 0.049), (38, 0.063), (39, 0.006), (40, -0.038), (41, -0.006), (42, -0.043), (43, 0.053), (44, -0.044), (45, 0.048), (46, 0.001), (47, -0.001), (48, -0.011), (49, 0.065)]
simIndex simValue blogId blogTitle
same-blog 1 0.97249675 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
2 0.81053519 62 hunch net-2005-04-26-To calibrate or not?
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
3 0.69107598 5 hunch net-2005-01-26-Watchword: Probability
Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in
4 0.67621505 33 hunch net-2005-02-28-Regularization
Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s
5 0.64603573 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
6 0.6007992 118 hunch net-2005-10-07-On-line learning of regular decision rules
7 0.5935142 330 hunch net-2008-12-07-A NIPS paper
8 0.5607776 123 hunch net-2005-10-16-Complexity: It’s all in your head
9 0.55957747 302 hunch net-2008-05-25-Inappropriate Mathematics for Machine Learning
10 0.54738235 160 hunch net-2006-03-02-Why do people count for learning?
11 0.51744509 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
12 0.5166806 55 hunch net-2005-04-10-Is the Goal Understanding or Prediction?
13 0.51504636 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
14 0.51016033 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
15 0.50989836 34 hunch net-2005-03-02-Prior, “Prior” and Bias
16 0.5093289 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression
17 0.50295115 70 hunch net-2005-05-12-Math on the Web
18 0.50230509 67 hunch net-2005-05-06-Don’t mix the solution into the problem
19 0.49714035 176 hunch net-2006-05-01-A conversation between Theo and Pat
20 0.49586076 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time
topicId topicWeight
[(3, 0.031), (9, 0.017), (27, 0.289), (38, 0.045), (45, 0.271), (53, 0.037), (55, 0.115), (94, 0.098)]
simIndex simValue blogId blogTitle
1 0.98041552 186 hunch net-2006-06-24-Online convex optimization at COLT
Introduction: At ICML 2003 , Marty Zinkevich proposed the online convex optimization setting and showed that a particular gradient descent algorithm has regret O(T 0.5 ) with respect to the best predictor where T is the number of rounds. This seems to be a nice model for online learning, and there has been some significant follow-up work. At COLT 2006 Elad Hazan , Adam Kalai , Satyen Kale , and Amit Agarwal presented a modification which takes a Newton step guaranteeing O(log T) regret when the first and second derivatives are bounded. Then they applied these algorithms to portfolio management at ICML 2006 (with Robert Schapire ) yielding some very fun graphs.
same-blog 2 0.91152138 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
3 0.7584222 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of
4 0.75671577 325 hunch net-2008-11-10-ICML Reviewing Criteria
Introduction: Michael Littman and Leon Bottou have decided to use a franchise program chair approach to reviewing at ICML this year. I’ll be one of the area chairs, so I wanted to mention a few things if you are thinking about naming me. I take reviewing seriously. That means papers to be reviewed are read, the implications are considered, and decisions are only made after that. I do my best to be fair, and there are zero subjects that I consider categorical rejects. I don’t consider several arguments for rejection-not-on-the-merits reasonable . I am generally interested in papers that (a) analyze new models of machine learning, (b) provide new algorithms, and (c) show that they work empirically on plausibly real problems. If a paper has the trifecta, I’m particularly interested. With 2 out of 3, I might be interested. I often find papers with only one element harder to accept, including papers with just (a). I’m a bit tough. I rarely jump-up-and-down about a paper, because I b
5 0.75210291 351 hunch net-2009-05-02-Wielding a New Abstraction
Introduction: This post is partly meant as an advertisement for the reductions tutorial Alina , Bianca , and I are planning to do at ICML . Please come, if you are interested. Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems. In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, yo
6 0.75170648 343 hunch net-2009-02-18-Decision by Vetocracy
7 0.75160784 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?
8 0.75135612 95 hunch net-2005-07-14-What Learning Theory might do
9 0.74917561 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
10 0.74908769 304 hunch net-2008-06-27-Reviewing Horror Stories
11 0.74888664 44 hunch net-2005-03-21-Research Styles in Machine Learning
12 0.74843228 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
13 0.74829364 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
14 0.74675453 259 hunch net-2007-08-19-Choice of Metrics
15 0.74615395 51 hunch net-2005-04-01-The Producer-Consumer Model of Research
16 0.74601144 220 hunch net-2006-11-27-Continuizing Solutions
17 0.74579835 360 hunch net-2009-06-15-In Active Learning, the question changes
18 0.7457872 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
19 0.74575353 258 hunch net-2007-08-12-Exponentiated Gradient
20 0.74453592 252 hunch net-2007-07-01-Watchword: Online Learning