hunch_net hunch_net-2006 hunch_net-2006-191 knowledge-graph by maker-knowledge-mining

191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?


meta infos for this blog

Source: html

Introduction: A few weeks ago I read this . David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. But it appear that they are not, and that they can even be in conflict ! Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5. Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). Now consider a new piece of evidence A=”X is an odd number” Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. [sent-2, score-0.52]

2 But it appear that they are not, and that they can even be in conflict ! [sent-3, score-0.083]

3 Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3. [sent-4, score-0.224]

4 Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). [sent-6, score-0.384]

5 Now consider a new piece of evidence A=”X is an odd number” Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). [sent-7, score-0.417]

6 (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3. [sent-14, score-0.342]

7 For bayesian updating, P(X|A) doesn’t have to have a 3. [sent-16, score-0.223]

8 MaxEnt and bayesian updating are two different principle leading to different belief distributions. [sent-20, score-1.047]

9 I don’t believe there is any paradox at all between MaxEnt (perhaps more generally, MinRelEnt) and Bayesian updates. [sent-22, score-0.177]

10 The implication of the problem is that the ensemble average 3. [sent-24, score-0.262]

11 That is, we no longer believe the contraint E[X]=3. [sent-26, score-0.209]

12 5 once we have the additional data that X is an odd number. [sent-27, score-0.161]

13 The sequential update using minimum relative entropy is identical to Bayes rule and produces the correct answer. [sent-28, score-0.822]

14 These two answers are simply (correct) answers to different questions. [sent-29, score-0.451]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('maxent', 0.604), ('updating', 0.268), ('entropy', 0.228), ('bayesian', 0.223), ('indicator', 0.199), ('odd', 0.161), ('leads', 0.156), ('maximum', 0.136), ('answers', 0.134), ('longer', 0.131), ('belief', 0.113), ('correct', 0.112), ('consider', 0.112), ('different', 0.11), ('average', 0.108), ('sides', 0.107), ('straight', 0.107), ('paradox', 0.099), ('thinking', 0.094), ('probable', 0.094), ('blei', 0.094), ('conclusion', 0.086), ('conflict', 0.083), ('function', 0.082), ('implication', 0.08), ('identical', 0.08), ('believe', 0.078), ('thanks', 0.078), ('pointing', 0.078), ('sequential', 0.078), ('produces', 0.078), ('indeed', 0.078), ('posterior', 0.076), ('principle', 0.076), ('weeks', 0.074), ('leading', 0.074), ('ensemble', 0.074), ('constraint', 0.074), ('two', 0.073), ('piece', 0.068), ('spent', 0.068), ('minimum', 0.064), ('rule', 0.061), ('update', 0.061), ('constraints', 0.061), ('relative', 0.06), ('bayes', 0.059), ('numbers', 0.058), ('ago', 0.057), ('david', 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

Introduction: A few weeks ago I read this . David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. But it appear that they are not, and that they can even be in conflict ! Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5. Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). Now consider a new piece of evidence A=”X is an odd number” Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5

2 0.16455606 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

3 0.131466 185 hunch net-2006-06-16-Regularization = Robustness

Introduction: The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math, argmax_p H(p) s.t. E_p[f_i] = c_i is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints). A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here . In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the expone

4 0.11298203 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

5 0.10052036 107 hunch net-2005-09-05-Site Update

Introduction: I tweaked the site in a number of ways today, including: Updating to WordPress 1.5. Installing and heavily tweaking the Geekniche theme. Update: I switched back to a tweaked version of the old theme. Adding the Customizable Post Listings plugin. Installing the StatTraq plugin. Updating some of the links. I particularly recommend looking at the computer research policy blog. Adding threaded comments . This doesn’t thread old comments obviously, but the extra structure may be helpful for new ones. Overall, I think this is an improvement, and it addresses a few of my earlier problems . If you have any difficulties or anything seems “not quite right”, please speak up. A few other tweaks to the site may happen in the near future.

6 0.083205514 8 hunch net-2005-02-01-NIPS: Online Bayes

7 0.080351837 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

8 0.068656355 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

9 0.065441012 5 hunch net-2005-01-26-Watchword: Probability

10 0.063682824 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

11 0.062645406 458 hunch net-2012-03-06-COLT-ICML Open Questions and ICML Instructions

12 0.062091857 16 hunch net-2005-02-09-Intuitions from applied learning

13 0.060491852 39 hunch net-2005-03-10-Breaking Abstractions

14 0.059133757 131 hunch net-2005-11-16-The Everything Ensemble Edge

15 0.05903348 127 hunch net-2005-11-02-Progress in Active Learning

16 0.059026018 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

17 0.058660932 34 hunch net-2005-03-02-Prior, “Prior” and Bias

18 0.058092725 368 hunch net-2009-08-26-Another 10-year paper in Machine Learning

19 0.057559121 432 hunch net-2011-04-20-The End of the Beginning of Active Learning

20 0.057309087 347 hunch net-2009-03-26-Machine Learning is too easy


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.12), (1, 0.037), (2, 0.007), (3, 0.007), (4, 0.012), (5, -0.004), (6, -0.013), (7, 0.044), (8, 0.099), (9, -0.027), (10, 0.006), (11, -0.026), (12, 0.027), (13, -0.039), (14, 0.06), (15, -0.062), (16, -0.112), (17, -0.047), (18, 0.027), (19, -0.046), (20, -0.046), (21, 0.029), (22, 0.082), (23, -0.042), (24, 0.014), (25, -0.005), (26, -0.025), (27, -0.006), (28, 0.059), (29, 0.03), (30, 0.043), (31, -0.06), (32, 0.007), (33, -0.025), (34, 0.063), (35, 0.031), (36, 0.012), (37, 0.008), (38, -0.027), (39, 0.131), (40, -0.031), (41, 0.015), (42, -0.016), (43, 0.005), (44, 0.023), (45, -0.061), (46, 0.137), (47, -0.073), (48, -0.001), (49, 0.038)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.983527 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

Introduction: A few weeks ago I read this . David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. But it appear that they are not, and that they can even be in conflict ! Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5. Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). Now consider a new piece of evidence A=”X is an odd number” Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5

2 0.71737736 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

3 0.68901205 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

4 0.63409609 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

Introduction: I have recently completed a 500+ page-book on MDL , the first comprehensive overview of the field (yes, this is a sneak advertisement ). Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are: In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron , MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes! In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by somethin

5 0.5904789 39 hunch net-2005-03-10-Breaking Abstractions

Introduction: Sam Roweis ‘s comment reminds me of a more general issue that comes up in doing research: abstractions always break. Real number’s aren’t. Most real numbers can not be represented with any machine. One implication of this is that many real-number based algorithms have difficulties when implemented with floating point numbers. The box on your desk is not a turing machine. A turing machine can compute anything computable, given sufficient time. A typical computer fails terribly when the state required for the computation exceeds some limit. Nash equilibria aren’t equilibria. This comes up when trying to predict human behavior based on the result of the equilibria computation. Often, it doesn’t work. The probability isn’t. Probability is an abstraction expressing either our lack of knowledge (the Bayesian viewpoint) or fundamental randomization (the frequentist viewpoint). From the frequentist viewpoint the lack of knowledge typically precludes actually knowing the fu

6 0.56911141 34 hunch net-2005-03-02-Prior, “Prior” and Bias

7 0.55418533 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

8 0.49706876 5 hunch net-2005-01-26-Watchword: Probability

9 0.4879398 160 hunch net-2006-03-02-Why do people count for learning?

10 0.43503737 253 hunch net-2007-07-06-Idempotent-capable Predictors

11 0.43231708 123 hunch net-2005-10-16-Complexity: It’s all in your head

12 0.41198307 205 hunch net-2006-09-07-Objective and subjective interpretations of probability

13 0.403835 16 hunch net-2005-02-09-Intuitions from applied learning

14 0.40029991 7 hunch net-2005-01-31-Watchword: Assumption

15 0.39995527 217 hunch net-2006-11-06-Data Linkage Problems

16 0.39327386 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

17 0.39326 107 hunch net-2005-09-05-Site Update

18 0.39257845 8 hunch net-2005-02-01-NIPS: Online Bayes

19 0.38845766 222 hunch net-2006-12-05-Recruitment Conferences

20 0.38069066 57 hunch net-2005-04-16-Which Assumptions are Reasonable?


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(27, 0.168), (38, 0.053), (53, 0.138), (55, 0.072), (66, 0.326), (94, 0.064), (95, 0.056)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.82445389 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

Introduction: A few weeks ago I read this . David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. But it appear that they are not, and that they can even be in conflict ! Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5. Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). Now consider a new piece of evidence A=”X is an odd number” Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5

2 0.7904858 385 hunch net-2009-12-27-Interesting things at NIPS 2009

Introduction: Several papers at NIPS caught my attention. Elad Hazan and Satyen Kale , Online Submodular Optimization They define an algorithm for online optimization of submodular functions with regret guarantees. This places submodular optimization roughly on par with online convex optimization as tractable settings for online learning. Elad Hazan and Satyen Kale On Stochastic and Worst-Case Models of Investing . At it’s core, this is yet another example of modifying worst-case online learning to deal with variance, but the application to financial models is particularly cool and it seems plausibly superior other common approaches for financial modeling. Mark Palatucci , Dean Pomerlau , Tom Mitchell , and Geoff Hinton Zero Shot Learning with Semantic Output Codes The goal here is predicting a label in a multiclass supervised setting where the label never occurs in the training data. They have some basic analysis and also a nice application to FMRI brain reading. Sh

3 0.77317411 374 hunch net-2009-10-10-ALT 2009

Introduction: I attended ALT (“Algorithmic Learning Theory”) for the first time this year. My impression is ALT = 0.5 COLT, by attendance and also by some more intangible “what do I get from it?” measure. There are many differences which can’t quite be described this way though. The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength. One paper that might interest people generally is: Alexey Chernov and Vladimir Vovk , Prediction with Expert Evaluators’ Advice . The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. This might interest anyone who has run into “which loss function?” debates that come up periodically.

4 0.57989913 141 hunch net-2005-12-17-Workshops as Franchise Conferences

Introduction: Founding a successful new conference is extraordinarily difficult. As a conference founder, you must manage to attract a significant number of good papers—enough to entice the participants into participating next year and to (generally) to grow the conference. For someone choosing to participate in a new conference, there is a very significant decision to make: do you send a paper to some new conference with no guarantee that the conference will work out? Or do you send it to another (possibly less related) conference that you are sure will work? The conference founding problem is a joint agreement problem with a very significant barrier. Workshops are a way around this problem, and workshops attached to conferences are a particularly effective means for this. A workshop at a conference is sure to have people available to speak and attend and is sure to have a large audience available. Presenting work at a workshop is not generally exclusive: it can also be presented at a confe

5 0.57819426 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class

Introduction: Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU . This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous. This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers. For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through th

6 0.57249665 201 hunch net-2006-08-07-The Call of the Deep

7 0.57048386 370 hunch net-2009-09-18-Necessary and Sufficient Research

8 0.56898618 131 hunch net-2005-11-16-The Everything Ensemble Edge

9 0.56885058 19 hunch net-2005-02-14-Clever Methods of Overfitting

10 0.56858355 151 hunch net-2006-01-25-1 year

11 0.56570429 134 hunch net-2005-12-01-The Webscience Future

12 0.56558609 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

13 0.56460696 297 hunch net-2008-04-22-Taking the next step

14 0.56244886 12 hunch net-2005-02-03-Learning Theory, by assumption

15 0.56218439 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

16 0.5614841 358 hunch net-2009-06-01-Multitask Poisoning

17 0.55964208 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

18 0.55810744 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

19 0.55777723 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

20 0.55770385 207 hunch net-2006-09-12-Incentive Compatible Reviewing