hunch_net hunch_net-2006 hunch_net-2006-157 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
sentIndex sentText sentNum sentScore
1 This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. [sent-1, score-0.189]
2 Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. [sent-3, score-0.54]
3 Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. [sent-4, score-0.434]
4 When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. [sent-6, score-0.285]
5 Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values. [sent-7, score-0.567]
6 ) There is nothing wrong with this method as long as (a) the prior for the sample counts is very strong and (b) the prior (on the conditional independences and the sample counts) is “correct”—the actual problem is drawn from it. [sent-8, score-0.949]
7 When trying to estimate a coin with bias Pr(feature i | label) , after observing n IID samples, the estimate is accurate to (at most) c/m for some constant c . [sent-11, score-0.634]
8 ) Given this observation, we should expect the estimates Pr’ to differ by c/m or more when the prior on the sample counts is weak. [sent-14, score-0.437]
9 The problem to notice is that errors of c/m can quickly accumulate. [sent-15, score-0.154]
10 The final product in the naive bayes classifier is n-way linear in the error terms where n is the number of features. [sent-16, score-1.157]
11 If every features true value happens to be v and we happen to have a 1/2 + 1/n 0. [sent-17, score-0.293]
12 5 feature fraction estimate too large and 1/2 – 1/n 0. [sent-18, score-0.416]
13 5 fraction estimate too small (as might happen with a reasonable chance), the value of the product might be overestimated by: (v – c/m) n/2 + n^0. [sent-19, score-0.649]
14 5 /m , which suggests problems must arise when the number of features n is greater than the number of samples squared n > m 2 . [sent-22, score-0.332]
15 This can actually happen in the text classification settings where Naive Bayes is often applied. [sent-23, score-0.174]
16 All of the above is under the assumption that the conditional independences encoded in the Naive Bayes classifier are correct for the problem. [sent-24, score-0.63]
17 When these aren’t correct, as is often true in practice, the estimation errors can be systematic rather than stochastic implying much more brittle behavior. [sent-25, score-0.412]
18 In all of the above, note that we used Naive bayes as a simple example—this brittleness can be found in a number of other common prediction systems. [sent-26, score-0.627]
19 (a) The process of integrating over a posterior rather than taking the maximum likelihood element of a posterior tends to reduce the sampling effects. [sent-30, score-0.268]
20 (b) Realize that the conditional independence assumptions producing the multiplication are probably excessively strong and design softer priors which better fit reasonable beliefs. [sent-31, score-0.523]
wordName wordTfidf (topN-words)
[('naive', 0.375), ('pr', 0.344), ('bayes', 0.315), ('estimate', 0.212), ('counts', 0.179), ('label', 0.176), ('product', 0.172), ('probabilities', 0.166), ('conditional', 0.162), ('brittleness', 0.161), ('independences', 0.161), ('classifier', 0.132), ('feature', 0.122), ('estimated', 0.119), ('correct', 0.112), ('happen', 0.111), ('posterior', 0.101), ('errors', 0.099), ('estimation', 0.099), ('sample', 0.098), ('prior', 0.097), ('features', 0.091), ('true', 0.091), ('number', 0.085), ('fraction', 0.082), ('bias', 0.078), ('final', 0.078), ('maximizing', 0.072), ('overestimated', 0.072), ('softer', 0.072), ('samples', 0.071), ('according', 0.067), ('prediction', 0.066), ('coin', 0.066), ('likelihood', 0.066), ('observing', 0.066), ('excessively', 0.066), ('actually', 0.063), ('brittle', 0.063), ('estimates', 0.063), ('encoded', 0.063), ('indicates', 0.06), ('systematic', 0.06), ('formula', 0.057), ('design', 0.057), ('strong', 0.057), ('priors', 0.055), ('notice', 0.055), ('producing', 0.054), ('notation', 0.054)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
2 0.22651605 160 hunch net-2006-03-02-Why do people count for learning?
Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat
3 0.17322038 12 hunch net-2005-02-03-Learning Theory, by assumption
Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.
4 0.1669782 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
5 0.1517214 165 hunch net-2006-03-23-The Approximation Argument
Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.
6 0.14827783 235 hunch net-2007-03-03-All Models of Learning have Flaws
7 0.13150856 347 hunch net-2009-03-26-Machine Learning is too easy
8 0.11741754 43 hunch net-2005-03-18-Binomial Weighting
9 0.1151427 62 hunch net-2005-04-26-To calibrate or not?
10 0.11149916 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
11 0.11008079 274 hunch net-2007-11-28-Computational Consequences of Classification
12 0.10853373 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)
13 0.10802893 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
14 0.097924292 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning
15 0.097199038 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
16 0.096875921 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
17 0.095278375 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
18 0.094758727 351 hunch net-2009-05-02-Wielding a New Abstraction
19 0.094087332 183 hunch net-2006-06-14-Explorations of Exploration
20 0.092174686 237 hunch net-2007-04-02-Contextual Scaling
topicId topicWeight
[(0, 0.201), (1, 0.136), (2, 0.039), (3, -0.028), (4, -0.005), (5, -0.019), (6, 0.025), (7, 0.062), (8, 0.088), (9, -0.072), (10, -0.05), (11, -0.015), (12, 0.092), (13, -0.049), (14, 0.039), (15, -0.127), (16, -0.168), (17, 0.007), (18, 0.022), (19, -0.062), (20, -0.029), (21, 0.005), (22, 0.168), (23, -0.021), (24, -0.0), (25, -0.01), (26, 0.059), (27, 0.039), (28, 0.004), (29, -0.046), (30, 0.041), (31, -0.067), (32, -0.07), (33, 0.025), (34, 0.037), (35, 0.005), (36, -0.056), (37, 0.078), (38, 0.074), (39, 0.057), (40, 0.031), (41, -0.065), (42, -0.0), (43, -0.022), (44, -0.047), (45, -0.012), (46, -0.011), (47, 0.062), (48, 0.042), (49, 0.022)]
simIndex simValue blogId blogTitle
same-blog 1 0.97719419 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
2 0.78758907 160 hunch net-2006-03-02-Why do people count for learning?
Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat
3 0.7126655 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)
Introduction: I have recently completed a 500+ page-book on MDL , the first comprehensive overview of the field (yes, this is a sneak advertisement ). Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are: In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron , MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes! In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by somethin
4 0.6693182 7 hunch net-2005-01-31-Watchword: Assumption
Introduction: “Assumption” is another word to be careful with in machine learning because it is used in several ways. Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “ no free lunch ” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning. Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage. Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer. One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if no
5 0.66230887 43 hunch net-2005-03-18-Binomial Weighting
Introduction: Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x , make a prediction y’ (possibly based on the classifiers c ), and then see the correct label y . When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log 2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated. Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual clas
6 0.66160721 218 hunch net-2006-11-20-Context and the calculation misperception
7 0.65888774 34 hunch net-2005-03-02-Prior, “Prior” and Bias
8 0.65658325 62 hunch net-2005-04-26-To calibrate or not?
9 0.62574571 57 hunch net-2005-04-16-Which Assumptions are Reasonable?
10 0.61767298 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets
11 0.61421955 165 hunch net-2006-03-23-The Approximation Argument
12 0.58702475 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?
13 0.58360994 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound
14 0.57840437 12 hunch net-2005-02-03-Learning Theory, by assumption
15 0.5695188 217 hunch net-2006-11-06-Data Linkage Problems
16 0.53769135 5 hunch net-2005-01-26-Watchword: Probability
17 0.52143693 185 hunch net-2006-06-16-Regularization = Robustness
18 0.5138849 123 hunch net-2005-10-16-Complexity: It’s all in your head
19 0.50658327 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
20 0.50335437 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning
topicId topicWeight
[(9, 0.354), (27, 0.24), (53, 0.053), (55, 0.067), (77, 0.049), (92, 0.029), (94, 0.075), (95, 0.035)]
simIndex simValue blogId blogTitle
1 0.94851708 399 hunch net-2010-05-20-Google Predict
Introduction: Slashdot points out Google Predict . I’m not privy to the details, but this has the potential to be extremely useful, as in many applications simply having an easy mechanism to apply existing learning algorithms can be extremely helpful. This differs goalwise from MLcomp —instead of public comparisons for research purposes, it’s about private utilization of good existing algorithms. It also differs infrastructurally, since a system designed to do this is much less awkward than using Amazon’s cloud computing. The latter implies that datasets several order of magnitude larger can be handled up to limits imposed by network and storage.
2 0.91466737 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009
Introduction: Here’s a list of papers that I found interesting at ICML / COLT / UAI in 2009. Elad Hazan and Comandur Seshadhri Efficient learning algorithms for changing environments at ICML. This paper shows how to adapt learning algorithms that compete with fixed predictors to compete with changing policies. The definition of regret they deal with seems particularly useful in many situation. Hal Daume , Unsupervised Search-based Structured Prediction at ICML. This paper shows a technique for reducing unsupervised learning to supervised learning which (a) make a fast unsupervised learning algorithm and (b) makes semisupervised learning both easy and highly effective. There were two papers with similar results on active learning in the KWIK framework for linear regression, both reducing the sample complexity to . One was Nicolo Cesa-Bianchi , Claudio Gentile , and Francesco Orabona Robust Bounds for Classification via Selective Sampling at ICML and the other was Thoma
same-blog 3 0.9011355 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
4 0.88223672 330 hunch net-2008-12-07-A NIPS paper
Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y
5 0.8713882 490 hunch net-2013-11-09-Graduates and Postdocs
Introduction: Several strong graduates are on the job market this year. Alekh Agarwal made the most scalable public learning algorithm as an intern two years ago. He has a deep and broad understanding of optimization and learning as well as the ability and will to make things happen programming-wise. I’ve been privileged to have Alekh visiting me in NY where he will be sorely missed. John Duchi created Adagrad which is a commonly helpful improvement over online gradient descent that is seeing wide adoption, including in Vowpal Wabbit . He has a similarly deep and broad understanding of optimization and learning with significant industry experience at Google . Alekh and John have often coauthored together. Stephane Ross visited me a year ago over the summer, implementing many new algorithms and working out the first scale free online update rule which is now the default in Vowpal Wabbit. Stephane is not on the market—Google robbed the cradle successfully I’m sure that
6 0.84512645 174 hunch net-2006-04-27-Conferences, Workshops, and Tutorials
7 0.6649633 8 hunch net-2005-02-01-NIPS: Online Bayes
8 0.66078788 403 hunch net-2010-07-18-ICML & COLT 2010
9 0.63846892 160 hunch net-2006-03-02-Why do people count for learning?
10 0.6354261 360 hunch net-2009-06-15-In Active Learning, the question changes
11 0.63520277 258 hunch net-2007-08-12-Exponentiated Gradient
12 0.62723815 325 hunch net-2008-11-10-ICML Reviewing Criteria
13 0.62698066 309 hunch net-2008-07-10-Interesting papers, ICML 2008
14 0.62266457 220 hunch net-2006-11-27-Continuizing Solutions
15 0.62254173 432 hunch net-2011-04-20-The End of the Beginning of Active Learning
16 0.62206733 388 hunch net-2010-01-24-Specializations of the Master Problem
17 0.61962491 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
18 0.61506557 259 hunch net-2007-08-19-Choice of Metrics
19 0.61444539 345 hunch net-2009-03-08-Prediction Science
20 0.61061209 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class