hunch_net hunch_net-2006 hunch_net-2006-160 knowledge-graph by maker-knowledge-mining

160 hunch net-2006-03-02-Why do people count for learning?


meta infos for this blog

Source: html

Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This post is about a confusion of mine with respect to many commonly used machine learning algorithms. [sent-1, score-0.198]

2 A simple example where this comes up is Bayes net prediction. [sent-2, score-0.344]

3 A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. [sent-3, score-0.777]

4 The joint probability distribution over the variables is given by a set of conditional probabilities. [sent-4, score-0.251]

5 For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . [sent-5, score-0.697]

6 If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. [sent-6, score-0.766]

7 Naturally, in the “big learning” situations where this applies, the precise choice of prior can greatly effect the system performance leading to finicky tuning of various sorts. [sent-8, score-0.581]

8 It’s also fairly common to fit some parametric model (such as a Gaussian) in an attempt to predict A given B and C . [sent-9, score-0.081]

9 Stepping back a bit, we can think of the estimation of P(A | B, C) as a simple self-contained prediction (sub)problem. [sent-10, score-0.266]

10 Why don’t we use existing technology for doing this prediction? [sent-11, score-0.084]

11 Viewed as a learning algorithm “counting with a Dirichlet prior” is exactly memorizing the training set and then predicting according to either (precisely) matching training set elements or using a default. [sent-12, score-0.892]

12 It’s hard to imagine a more primitive learning algorithm. [sent-13, score-0.081]

13 There seems to be little downside to trying this approach. [sent-14, score-0.094]

14 In low count situations, a general purpose prediction algorithm has a reasonable hope of performing well. [sent-15, score-1.184]

15 In a high count situation, any reasonable general purpose algorithm converges to the same estimate as above. [sent-16, score-1.167]

16 Using a general purpose probabilistic prediction algorithm isn’t a new idea, (for example, see page 57 ), but it appears greatly underutilized. [sent-18, score-0.825]

17 This is a very small modification of existing systems with a real hope of dealing with low counts in {speech recognition, machine translation, vision}. [sent-19, score-0.359]

18 It seems that using a general purpose probabilistic prediction algorithm should be the default rather than the exception. [sent-20, score-0.931]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('estimate', 0.24), ('purpose', 0.24), ('net', 0.235), ('count', 0.216), ('prior', 0.183), ('bayes', 0.167), ('reasonable', 0.167), ('dirichlet', 0.163), ('prediction', 0.157), ('translation', 0.157), ('joint', 0.144), ('according', 0.143), ('default', 0.129), ('situations', 0.122), ('vision', 0.122), ('algorithm', 0.114), ('commonly', 0.113), ('simple', 0.109), ('general', 0.109), ('probabilistic', 0.107), ('set', 0.107), ('naturally', 0.101), ('greatly', 0.098), ('low', 0.096), ('directed', 0.094), ('falls', 0.094), ('edges', 0.094), ('compensate', 0.094), ('downside', 0.094), ('modification', 0.094), ('training', 0.092), ('finicky', 0.089), ('tuning', 0.089), ('virtual', 0.085), ('apart', 0.085), ('performing', 0.085), ('indicate', 0.085), ('counts', 0.085), ('exception', 0.085), ('mine', 0.085), ('existing', 0.084), ('either', 0.084), ('nodes', 0.081), ('node', 0.081), ('primitive', 0.081), ('converges', 0.081), ('parametric', 0.081), ('matching', 0.078), ('big', 0.075), ('using', 0.075)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999976 160 hunch net-2006-03-02-Why do people count for learning?

Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat

2 0.22651605 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is

3 0.14468879 237 hunch net-2007-04-02-Contextual Scaling

Introduction: Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following: Marginal . In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information. First context . A clever person realizes that some bit of information x 1 could be helpful. If x 1 is discrete, they condition on it and make a predictor h(x 1 ) , typically by counting. If they are clever, then they also do some smoothing. If x 1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand. Second . Another clever person (or perhaps the s

4 0.14282899 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

5 0.13531291 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

6 0.13222758 12 hunch net-2005-02-03-Learning Theory, by assumption

7 0.12621137 43 hunch net-2005-03-18-Binomial Weighting

8 0.12219325 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

9 0.11981238 34 hunch net-2005-03-02-Prior, “Prior” and Bias

10 0.112537 351 hunch net-2009-05-02-Wielding a New Abstraction

11 0.11197671 104 hunch net-2005-08-22-Do you believe in induction?

12 0.10799025 148 hunch net-2006-01-13-Benchmarks for RL

13 0.10429551 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

14 0.10364416 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

15 0.10103535 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

16 0.10037082 347 hunch net-2009-03-26-Machine Learning is too easy

17 0.10020778 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

18 0.099867336 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

19 0.099569447 95 hunch net-2005-07-14-What Learning Theory might do

20 0.098559923 388 hunch net-2010-01-24-Specializations of the Master Problem


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.252), (1, 0.123), (2, 0.008), (3, 0.022), (4, 0.025), (5, -0.056), (6, -0.017), (7, 0.073), (8, 0.115), (9, -0.055), (10, -0.069), (11, 0.015), (12, 0.058), (13, -0.043), (14, 0.065), (15, -0.085), (16, -0.07), (17, -0.026), (18, 0.072), (19, -0.031), (20, -0.074), (21, -0.023), (22, 0.113), (23, -0.026), (24, 0.011), (25, 0.041), (26, 0.019), (27, -0.0), (28, 0.034), (29, -0.111), (30, 0.072), (31, -0.01), (32, 0.046), (33, 0.054), (34, 0.067), (35, 0.004), (36, -0.052), (37, 0.071), (38, 0.034), (39, 0.035), (40, -0.009), (41, -0.065), (42, -0.014), (43, 0.02), (44, -0.018), (45, 0.0), (46, -0.023), (47, 0.059), (48, 0.058), (49, 0.018)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95800352 160 hunch net-2006-03-02-Why do people count for learning?

Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat

2 0.85270745 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is

3 0.72870797 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

Introduction: I have recently completed a 500+ page-book on MDL , the first comprehensive overview of the field (yes, this is a sneak advertisement ). Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are: In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron , MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes! In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by somethin

4 0.69593048 34 hunch net-2005-03-02-Prior, “Prior” and Bias

Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt

5 0.69510847 43 hunch net-2005-03-18-Binomial Weighting

Introduction: Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x , make a prediction y’ (possibly based on the classifiers c ), and then see the correct label y . When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log 2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated. Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual clas

6 0.68425804 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

7 0.67597139 62 hunch net-2005-04-26-To calibrate or not?

8 0.67074621 237 hunch net-2007-04-02-Contextual Scaling

9 0.65883476 165 hunch net-2006-03-23-The Approximation Argument

10 0.62531662 217 hunch net-2006-11-06-Data Linkage Problems

11 0.60278565 218 hunch net-2006-11-20-Context and the calculation misperception

12 0.60153073 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

13 0.59355712 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

14 0.58828491 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

15 0.57345206 185 hunch net-2006-06-16-Regularization = Robustness

16 0.56841433 104 hunch net-2005-08-22-Do you believe in induction?

17 0.56233078 235 hunch net-2007-03-03-All Models of Learning have Flaws

18 0.55448556 7 hunch net-2005-01-31-Watchword: Assumption

19 0.5412724 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

20 0.53705341 12 hunch net-2005-02-03-Learning Theory, by assumption


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.037), (9, 0.048), (10, 0.025), (27, 0.22), (38, 0.087), (53, 0.051), (55, 0.093), (60, 0.24), (77, 0.014), (94, 0.083), (95, 0.019)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.88630068 160 hunch net-2006-03-02-Why do people count for learning?

Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat

2 0.86142319 424 hunch net-2011-02-17-What does Watson mean?

Introduction: Watson convincingly beat the best champion Jeopardy! players. The apparent significance of this varies hugely, depending on your background knowledge about the related machine learning, NLP, and search technology. For a random person, this might seem evidence of serious machine intelligence, while for people working on the system itself, it probably seems like a reasonably good assemblage of existing technologies with several twists to make the entire system work. Above all, I think we should congratulate the people who managed to put together and execute this project—many years of effort by a diverse set of highly skilled people were needed to make this happen. In academia, it’s pretty difficult for one professor to assemble that quantity of talent, and in industry it’s rarely the case that such a capable group has both a worthwhile project and the support needed to pursue something like this for several years before success. Alina invited me to the Jeopardy watching party

3 0.84703356 198 hunch net-2006-07-25-Upcoming conference

Introduction: The Workshop for Women in Machine Learning will be held in San Diego on October 4, 2006. For details see the workshop website: http://www.seas.upenn.edu/~wiml/

4 0.74810493 379 hunch net-2009-11-23-ICML 2009 Workshops (and Tutorials)

Introduction: I’m the workshops chair for ICML this year. As such, I would like to personally encourage people to consider running a workshop. My general view of workshops is that they are excellent as opportunities to discuss and develop research directions—some of my best work has come from collaborations at workshops and several workshops have substantially altered my thinking about various problems. My experience running workshops is that setting them up and making them fly often appears much harder than it actually is, and the workshops often come off much better than expected in the end. Submissions are due January 18, two weeks before papers. Similarly, Ben Taskar is looking for good tutorials , which is complementary. Workshops are about exploring a subject, while a tutorial is about distilling it down into an easily taught essence, a vital part of the research process. Tutorials are due February 13, two weeks after papers.

5 0.73548669 159 hunch net-2006-02-27-The Peekaboom Dataset

Introduction: Luis von Ahn ‘s Peekaboom project has yielded data (830MB). Peekaboom is the second attempt (after Espgame ) to produce a dataset which is useful for learning to solve vision problems based on voluntary game play. As a second attempt, it is meant to address all of the shortcomings of the first attempt. In particular: The locations of specific objects are provided by the data. The data collection is far more complete and extensive. The data consists of: The source images. (1 file per image, just short of 60K images.) The in-game events. (1 file per image, in a lispy syntax.) A description of the event language. There is a great deal of very specific and relevant data here so the hope that this will help solve vision problems seems quite reasonable.

6 0.7287423 259 hunch net-2007-08-19-Choice of Metrics

7 0.72565991 95 hunch net-2005-07-14-What Learning Theory might do

8 0.72164702 343 hunch net-2009-02-18-Decision by Vetocracy

9 0.71963638 403 hunch net-2010-07-18-ICML & COLT 2010

10 0.71779227 235 hunch net-2007-03-03-All Models of Learning have Flaws

11 0.71764374 325 hunch net-2008-11-10-ICML Reviewing Criteria

12 0.71754217 204 hunch net-2006-08-28-Learning Theory standards for NIPS 2006

13 0.71748435 98 hunch net-2005-07-27-Not goal metrics

14 0.71702546 351 hunch net-2009-05-02-Wielding a New Abstraction

15 0.71672428 194 hunch net-2006-07-11-New Models

16 0.71632344 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

17 0.7160328 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?

18 0.71594554 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

19 0.71588331 360 hunch net-2009-06-15-In Active Learning, the question changes

20 0.71524292 131 hunch net-2005-11-16-The Everything Ensemble Edge