hunch_net hunch_net-2007 hunch_net-2007-263 knowledge-graph by maker-knowledge-mining

263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

meta infos for this blog

Source: html

Introduction: I have recently completed a 500+ page-book on MDL , the first comprehensive overview of the field (yes, this is a sneak advertisement ). Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are: In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron , MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes! In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by somethin

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 My two main points here are: In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. [sent-4, score-0.73]

2 By some theorems due to Andrew Barron , MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). [sent-5, score-0.365]

3 MDL is always based on designing a universal code L relative to some given model M. [sent-10, score-0.559]

4 Informally this is a code such that whenever some distribution P in M can be used to compress some data set well, then L will compress this data set well as well (I’ll skip the formal definition here). [sent-11, score-0.733]

5 One method (but by no means the only method) for designing a universal code relative to model M is by taking some prior W on M and using the corresponding Shannon-Fano code, i. [sent-12, score-0.621]

6 the code that encodes data z with length L(z) = – log P bayes (z), where P bayes (. [sent-14, score-0.646]

7 If M is parametric, then with just about any ‘smooth’ prior, the Bayesian code with lengths L(z) = – log P bayes (z) leads to a reasonable universal code. [sent-17, score-0.475]

8 But if M is nonparametric (infinite dimensional, such as in Gaussian process regression, or histogram density estimation with an arbitrary nr of components) then many priors which are perfectly fine according to Bayesian theory are ruled out by MDL theory. [sent-18, score-0.396]

9 The reason is that for some P in M, the Bayesian codes based on such priors do not compress data sampled from P at all, even if the amount of data tends to infinity. [sent-19, score-0.8]

10 One can formally prove that such Bayesian codes are not “universal” according to the standard definition of universality. [sent-20, score-0.21]

11 Now there exist two theorems by Andrew Barron (from 1991 and 1998, respectively) that directly connect data compression with frequentist statistical consistency. [sent-21, score-0.386]

12 In essence, they imply that estimation based on universal codes must always be statistically consistent (the theorems also directly connect the convergence rates to the amount of compression obtained). [sent-22, score-0.84]

13 These say that, for some nonparametric models M, and with some priors on M, Bayesian inference can be inconsistent, in the sense that for some P in M, if data are i. [sent-24, score-0.461]

14 In fact, MDL-based reasoning can also motivate certain prior choices in nonparametric contexts. [sent-29, score-0.226]

15 In general, it is often thought that different priors on M lead to codes that better compress data for some P in M, and that worse compress data for other P in M. [sent-33, score-0.842]

16 But with nonparametric contexts, it is not like that: then there exist priors with “universally good” and “universally bad” coding properties. [sent-34, score-0.436]

17 If model M nonparametric and P in M, then MDL consistent, Bayes not necessarily so. [sent-37, score-0.213]

18 For example, M is the set of all Gaussian mixtures with arbitrarily many components, and P is not a Gaussian mixture, but can be arbitrarily well-approximated (in the sense of KL divergence) by a sequence of Gaussian mixtures with ever more components? [sent-40, score-0.242]

19 it needs more data before the posterior converges than some other methods (like leave-one-out-cross-validation combined with ML estimation). [sent-43, score-0.206]

20 Since the method is directly based on universal coding, I’m tempted to call it “MDL”, but the fact that nobody in the MDL community has thought about our idea before, makes me hesitate. [sent-45, score-0.25]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('mdl', 0.672), ('barron', 0.212), ('bayes', 0.19), ('universal', 0.178), ('compress', 0.177), ('gaussian', 0.148), ('priors', 0.146), ('nonparametric', 0.141), ('bayesian', 0.131), ('codes', 0.118), ('data', 0.112), ('code', 0.107), ('diaconis', 0.106), ('freedman', 0.106), ('coding', 0.105), ('inconsistency', 0.094), ('statistically', 0.094), ('compression', 0.087), ('prior', 0.085), ('consistent', 0.083), ('parametric', 0.075), ('based', 0.072), ('theorems', 0.072), ('model', 0.072), ('connect', 0.071), ('jim', 0.071), ('mixtures', 0.071), ('components', 0.071), ('relative', 0.07), ('estimation', 0.065), ('rbf', 0.063), ('sampled', 0.063), ('inconsistent', 0.063), ('inference', 0.062), ('designing', 0.06), ('kernels', 0.05), ('universally', 0.05), ('infinite', 0.05), ('arbitrarily', 0.05), ('converges', 0.05), ('corresponding', 0.049), ('converge', 0.049), ('definition', 0.048), ('thus', 0.047), ('length', 0.047), ('posterior', 0.044), ('exist', 0.044), ('according', 0.044), ('therefore', 0.043), ('properties', 0.043)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

2 0.15996361 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

3 0.15615965 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

4 0.13789663 12 hunch net-2005-02-03-Learning Theory, by assumption

Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.

5 0.12639891 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner â€”usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

6 0.11935037 140 hunch net-2005-12-14-More NIPS Papers II

7 0.10853373 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

8 0.10364416 160 hunch net-2006-03-02-Why do people count for learning?

9 0.101179 235 hunch net-2007-03-03-All Models of Learning have Flaws

10 0.095532604 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

11 0.082857318 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression

12 0.079447478 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three

13 0.072855726 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

14 0.072577968 34 hunch net-2005-03-02-Prior, “Prior” and Bias

15 0.068656355 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

16 0.068554513 237 hunch net-2007-04-02-Contextual Scaling

17 0.06797336 95 hunch net-2005-07-14-What Learning Theory might do

18 0.065307491 347 hunch net-2009-03-26-Machine Learning is too easy

19 0.062545195 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

20 0.061857574 371 hunch net-2009-09-21-Netflix finishes (and starts)

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.146), (1, 0.079), (2, 0.007), (3, -0.019), (4, 0.047), (5, -0.016), (6, -0.009), (7, 0.044), (8, 0.125), (9, -0.074), (10, -0.005), (11, -0.032), (12, 0.02), (13, -0.054), (14, 0.072), (15, -0.103), (16, -0.075), (17, 0.0), (18, 0.093), (19, -0.155), (20, -0.001), (21, -0.009), (22, 0.055), (23, -0.054), (24, 0.011), (25, 0.004), (26, -0.006), (27, 0.035), (28, 0.006), (29, 0.013), (30, 0.042), (31, -0.074), (32, 0.032), (33, 0.027), (34, 0.079), (35, -0.034), (36, -0.03), (37, -0.035), (38, 0.075), (39, 0.073), (40, -0.007), (41, 0.002), (42, 0.038), (43, 0.026), (44, -0.009), (45, 0.035), (46, 0.016), (47, -0.002), (48, -0.062), (49, -0.007)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97004008 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

2 0.74916023 165 hunch net-2006-03-23-The Approximation Argument

3 0.74644005 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

4 0.68290907 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

Introduction: A few weeks ago I read this . David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that Ã¢â‚¬Å“bayesian belief updatingÃ¢â‚¬Â and Ã¢â‚¬Å“maximum entropyÃ¢â‚¬Â were two othogonal principles. But it appear that they are not, and that they can even be in conflict ! Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5. Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). Now consider a new piece of evidence A=Ã¢â‚¬ÂX is an odd numberÃ¢â‚¬Â Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more Ã¢â‚¬Ëœ6Ã¢â‚¬Â², big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesnÃ¢â‚¬â„¢t have to have a 3.5

5 0.68179125 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

Introduction: Say we have two random variables X,Y with mutual information I(X,Y) . Let’s say we want to represent them with a bayes net of the form X< -M->Y , such that the entropy of M equals the mutual information, i.e. H(M)=I(X,Y) . Intuitively, we would like our hidden state to be as simple as possible (entropy wise). The data processing inequality means that H(M)>=I(X,Y) , so the mutual information is a lower bound on how simple the M could be. Furthermore, if such a construction existed it would have a nice coding interpretation — one could jointly code X and Y by first coding the mutual information, then coding X with this mutual info (without Y ) and coding Y with this mutual info (without X ). It turns out that such a construction does not exist in general (Thx Alina Beygelzimer for a counterexample! see below for the sketch). What are the implications of this? Well, it’s hard for me to say, but it does suggest to me that the ‘generative’ model philosophy might be

6 0.66531515 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

7 0.62552726 160 hunch net-2006-03-02-Why do people count for learning?

8 0.62133622 140 hunch net-2005-12-14-More NIPS Papers II

9 0.60205823 34 hunch net-2005-03-02-Prior, “Prior” and Bias

10 0.57978785 123 hunch net-2005-10-16-Complexity: It’s all in your head

11 0.54569918 235 hunch net-2007-03-03-All Models of Learning have Flaws

12 0.52831531 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three

13 0.51685601 5 hunch net-2005-01-26-Watchword: Probability

14 0.50044447 12 hunch net-2005-02-03-Learning Theory, by assumption

15 0.48820749 237 hunch net-2007-04-02-Contextual Scaling

16 0.48647222 8 hunch net-2005-02-01-NIPS: Online Bayes

17 0.48585615 39 hunch net-2005-03-10-Breaking Abstractions

18 0.48473081 217 hunch net-2006-11-06-Data Linkage Problems

19 0.4649463 7 hunch net-2005-01-31-Watchword: Assumption

20 0.4630706 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.019), (1, 0.023), (3, 0.028), (26, 0.02), (27, 0.142), (38, 0.078), (53, 0.063), (55, 0.056), (62, 0.291), (77, 0.026), (94, 0.099), (95, 0.038)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.9156521 173 hunch net-2006-04-17-Rexa is live

Introduction: Rexa is now publicly available. Anyone can create an account and login. Rexa is similar to Citeseer and Google Scholar in functionality with more emphasis on the use of machine learning for intelligent information extraction. For example, Rexa can automatically display a picture on an authorâ€™s homepage when the author is searched for.

2 0.91427839 195 hunch net-2006-07-12-Who is having visa problems reaching US conferences?

Introduction: Many of the large machine learning conferences were in the US this summer. A common problem which students from abroad encounter is visa issues. Just getting a visa to visit can be pretty rough: you stand around in lines, sometimes for days. Even worse is the timing with respect to ticket buying. Airplane tickets typically need to be bought well in advance on nonrefundable terms to secure a reasonable rate for air travel. When a visa is denied, as happens reasonably often, a very expensive ticket is burnt. A serious effort is under way to raise this as in issue in need of fixing. Over the long term, effectively driving research conferences to locate outside of the US seems an unwise policy. Robert Schapire is planning to talk to a congressman. Sally Goldman suggested putting together a list of problem cases, and Phil Long setup an email address immigration.and.confs@gmail.com to collect them. If you (or someone you know) has had insurmountable difficulties reaching

3 0.86283255 394 hunch net-2010-04-24-COLT Treasurer is now Phil Long

Introduction: For about 5 years, I’ve been the treasurer of the Association for Computational Learning, otherwise known as COLT, taking over from John Case before me. A transfer of duties to Phil Long is now about complete. This probably matters to almost no one, but I wanted to describe things a bit for those interested. The immediate impetus for this decision was unhappiness over reviewing decisions at COLT 2009 , one as an author and several as a member of the program committee. I seem to have disagreements fairly often about what is important work, partly because I’m focused on learning theory with practical implications, partly because I define learning theory more broadly than is typical amongst COLT members, and partly because COLT suffers a bit from insider-clique issues. The degree to which these issues come up varies substantially each year so last year is not predictive of this one. And, it’s important to understand that COLT remains healthy with these issues not nearly so bad

4 0.85175216 128 hunch net-2005-11-05-The design of a computing cluster

Introduction: This is about the design of a computing cluster from the viewpoint of applied machine learning using current technology. We just built a small one at TTI so this is some evidence of what is feasible and thoughts about the design choices. Architecture There are several architectural choices. AMD Athlon64 based system. This seems to have the cheapest bang/buck. Maximum RAM is typically 2-3GB. AMD Opteron based system. Opterons provide the additional capability to buy an SMP motherboard with two chips, and the motherboards often support 16GB of RAM. The RAM is also the more expensive error correcting type. Intel PIV or Xeon based system. The PIV and Xeon based systems are the intel analog of the above 2. Due to architectural design reasons, these chips tend to run a bit hotter and be a bit more expensive. Dual core chips. Both Intel and AMD have chips that actually have 2 processors embedded in them. In the end, we decided to go with option (2). Roughly speaking,

same-blog 5 0.84359425 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

6 0.56895661 229 hunch net-2007-01-26-Parallel Machine Learning Problems

7 0.56109643 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

8 0.55211228 95 hunch net-2005-07-14-What Learning Theory might do

9 0.5521 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

10 0.55034441 237 hunch net-2007-04-02-Contextual Scaling

11 0.54958266 19 hunch net-2005-02-14-Clever Methods of Overfitting

12 0.54914039 49 hunch net-2005-03-30-What can Type Theory teach us about Machine Learning?

13 0.54814464 131 hunch net-2005-11-16-The Everything Ensemble Edge

14 0.54788786 306 hunch net-2008-07-02-Proprietary Data in Academic Research?

15 0.54713148 235 hunch net-2007-03-03-All Models of Learning have Flaws

16 0.54659522 233 hunch net-2007-02-16-The Forgetting

17 0.54526049 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

18 0.54478741 351 hunch net-2009-05-02-Wielding a New Abstraction

19 0.54424924 143 hunch net-2005-12-27-Automated Labeling

20 0.5421955 423 hunch net-2011-02-02-User preferences for search engines