hunch_net hunch_net-2005 hunch_net-2005-34 knowledge-graph by maker-knowledge-mining

34 hunch net-2005-03-02-Prior, “Prior” and Bias


meta infos for this blog

Source: html

Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. [sent-1, score-1.328]

2 Examples include Bayesian reasoning, prediction bounds, and online learning. [sent-2, score-0.278]

3 One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. [sent-3, score-1.269]

4 Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. [sent-4, score-2.013]

5 “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. [sent-5, score-1.117]

6 Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. [sent-6, score-1.16]

7 This only scratches the surface—there are yet more subtleties. [sent-7, score-0.052]

8 For example the (as mentioned in meaning of probability ) shifts from one viewpoint to another. [sent-8, score-0.779]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('predictor', 0.382), ('expresses', 0.322), ('prior', 0.291), ('reasoning', 0.257), ('saying', 0.234), ('probability', 0.208), ('distribution', 0.202), ('meaning', 0.187), ('prefer', 0.187), ('bias', 0.174), ('preferring', 0.149), ('surface', 0.149), ('bounds', 0.144), ('shifts', 0.141), ('bayesian', 0.134), ('specified', 0.114), ('implicitly', 0.114), ('regularization', 0.114), ('arises', 0.108), ('generating', 0.108), ('online', 0.102), ('distributions', 0.1), ('mentioned', 0.1), ('prediction', 0.099), ('manner', 0.098), ('degree', 0.097), ('suggest', 0.094), ('classifiers', 0.091), ('viewpoint', 0.088), ('network', 0.088), ('early', 0.087), ('belief', 0.085), ('set', 0.085), ('measure', 0.084), ('neural', 0.084), ('include', 0.077), ('exist', 0.076), ('classifier', 0.074), ('difficulty', 0.073), ('training', 0.073), ('etc', 0.067), ('method', 0.063), ('predict', 0.06), ('ways', 0.059), ('hope', 0.058), ('one', 0.055), ('useful', 0.052), ('yet', 0.052), ('another', 0.048), ('examples', 0.047)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 34 hunch net-2005-03-02-Prior, “Prior” and Bias

Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt

2 0.16482638 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

3 0.15926009 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

4 0.15013483 5 hunch net-2005-01-26-Watchword: Probability

Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in

5 0.13755475 33 hunch net-2005-02-28-Regularization

Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s

6 0.13615434 41 hunch net-2005-03-15-The State of Tight Bounds

7 0.13550368 8 hunch net-2005-02-01-NIPS: Online Bayes

8 0.12948377 237 hunch net-2007-04-02-Contextual Scaling

9 0.12566254 123 hunch net-2005-10-16-Complexity: It’s all in your head

10 0.12374199 43 hunch net-2005-03-18-Binomial Weighting

11 0.11981238 160 hunch net-2006-03-02-Why do people count for learning?

12 0.11609358 133 hunch net-2005-11-28-A question of quantification

13 0.11045702 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

14 0.10109165 28 hunch net-2005-02-25-Problem: Online Learning

15 0.10092787 289 hunch net-2008-02-17-The Meaning of Confidence

16 0.10043959 12 hunch net-2005-02-03-Learning Theory, by assumption

17 0.1003332 235 hunch net-2007-03-03-All Models of Learning have Flaws

18 0.099542245 90 hunch net-2005-07-07-The Limits of Learning Theory

19 0.099200279 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound

20 0.09735965 170 hunch net-2006-04-06-Bounds greater than 1


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.167), (1, 0.166), (2, 0.039), (3, -0.059), (4, 0.031), (5, -0.062), (6, -0.008), (7, 0.066), (8, 0.159), (9, -0.039), (10, 0.022), (11, 0.026), (12, 0.106), (13, -0.116), (14, 0.141), (15, -0.076), (16, -0.099), (17, -0.063), (18, 0.026), (19, -0.034), (20, 0.01), (21, -0.02), (22, 0.118), (23, -0.003), (24, 0.041), (25, 0.041), (26, 0.064), (27, 0.051), (28, -0.034), (29, 0.069), (30, 0.102), (31, 0.027), (32, 0.045), (33, -0.023), (34, -0.042), (35, -0.017), (36, -0.016), (37, 0.026), (38, -0.061), (39, 0.088), (40, -0.04), (41, -0.058), (42, -0.09), (43, 0.031), (44, 0.029), (45, -0.004), (46, -0.011), (47, 0.007), (48, -0.013), (49, -0.009)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98709381 34 hunch net-2005-03-02-Prior, “Prior” and Bias

Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt

2 0.69088954 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

3 0.66398257 160 hunch net-2006-03-02-Why do people count for learning?

Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat

4 0.65190083 5 hunch net-2005-01-26-Watchword: Probability

Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in

5 0.64902908 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is

6 0.6235072 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

7 0.61710441 123 hunch net-2005-10-16-Complexity: It’s all in your head

8 0.61642271 165 hunch net-2006-03-23-The Approximation Argument

9 0.61455834 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

10 0.58579528 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound

11 0.55104727 237 hunch net-2007-04-02-Contextual Scaling

12 0.5482679 33 hunch net-2005-02-28-Regularization

13 0.54272836 62 hunch net-2005-04-26-To calibrate or not?

14 0.54170161 43 hunch net-2005-03-18-Binomial Weighting

15 0.54111201 8 hunch net-2005-02-01-NIPS: Online Bayes

16 0.53173476 133 hunch net-2005-11-28-A question of quantification

17 0.52553374 12 hunch net-2005-02-03-Learning Theory, by assumption

18 0.50774276 39 hunch net-2005-03-10-Breaking Abstractions

19 0.50451857 218 hunch net-2006-11-20-Context and the calculation misperception

20 0.49602136 217 hunch net-2006-11-06-Data Linkage Problems


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.093), (27, 0.251), (38, 0.044), (53, 0.086), (55, 0.116), (80, 0.266), (94, 0.016)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.87818301 34 hunch net-2005-03-02-Prior, “Prior” and Bias

Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt

2 0.85723901 222 hunch net-2006-12-05-Recruitment Conferences

Introduction: One of the subsidiary roles of conferences is recruitment. NIPS is optimally placed in time for this because it falls right before the major recruitment season. I personally found job hunting embarrassing, and was relatively inept at it. I expect this is true of many people, because it is not something done often. The basic rule is: make the plausible hirers aware of your interest. Any corporate sponsor is a “plausible”, regardless of whether or not there is a booth. CRA and the acm job center are other reasonable sources. There are substantial differences between the different possibilities. Putting some effort into understanding the distinctions is a good idea, although you should always remember where the other person is coming from.

3 0.84534562 68 hunch net-2005-05-10-Learning Reductions are Reductionist

Introduction: This is about a fundamental motivation for the investigation of reductions in learning. It applies to many pieces of work other than my own. The reductionist approach to problem solving is characterized by taking a problem, decomposing it into as-small-as-possible subproblems, discovering how to solve the subproblems, and then discovering how to use the solutions to the subproblems to solve larger problems. The reductionist approach to solving problems has often payed off very well. Computer science related examples of the reductionist approach include: Reducing computation to the transistor. All of our CPUs are built from transistors. Reducing rendering of images to rendering a triangle (or other simple polygons). Computers can now render near-realistic scenes in real time. The big breakthrough came from learning how to render many triangles quickly. This approach to problem solving extends well beyond computer science. Many fields of science focus on theories mak

4 0.78832418 146 hunch net-2006-01-06-MLTV

Introduction: As part of a PASCAL project, the Slovenians have been filming various machine learning events and placing them on the web here . This includes, for example, the Chicago 2005 Machine Learning Summer School as well as a number of other summer schools, workshops, and conferences. There are some significant caveats here—for example, I can’t access it from Linux. Based upon the webserver logs, I expect that is a problem for most people—Computer scientists are particularly nonstandard in their choice of computing platform. Nevertheless, the core idea here is excellent and details of compatibility can be fixed later. With modern technology toys, there is no fundamental reason why the process of announcing new work at a conference should happen only once and only for the people who could make it to that room in that conference. The problems solved include: The multitrack vs. single-track debate. (“Sometimes the single track doesn’t interest me” vs. “When it’s multitrack I mis

5 0.7272082 141 hunch net-2005-12-17-Workshops as Franchise Conferences

Introduction: Founding a successful new conference is extraordinarily difficult. As a conference founder, you must manage to attract a significant number of good papers—enough to entice the participants into participating next year and to (generally) to grow the conference. For someone choosing to participate in a new conference, there is a very significant decision to make: do you send a paper to some new conference with no guarantee that the conference will work out? Or do you send it to another (possibly less related) conference that you are sure will work? The conference founding problem is a joint agreement problem with a very significant barrier. Workshops are a way around this problem, and workshops attached to conferences are a particularly effective means for this. A workshop at a conference is sure to have people available to speak and attend and is sure to have a large audience available. Presenting work at a workshop is not generally exclusive: it can also be presented at a confe

6 0.709454 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics

7 0.70742577 484 hunch net-2013-06-16-Representative Reviewing

8 0.70233655 289 hunch net-2008-02-17-The Meaning of Confidence

9 0.70139205 183 hunch net-2006-06-14-Explorations of Exploration

10 0.69726908 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

11 0.69255078 309 hunch net-2008-07-10-Interesting papers, ICML 2008

12 0.68895996 194 hunch net-2006-07-11-New Models

13 0.68071753 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

14 0.68071496 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?

15 0.6783511 225 hunch net-2007-01-02-Retrospective

16 0.67687172 134 hunch net-2005-12-01-The Webscience Future

17 0.67682439 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

18 0.67412829 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?

19 0.67403507 304 hunch net-2008-06-27-Reviewing Horror Stories

20 0.67398798 493 hunch net-2014-02-16-Metacademy: a package manager for knowledge