hunch_net hunch_net-2006 hunch_net-2006-165 knowledge-graph by maker-knowledge-mining

165 hunch net-2006-03-23-The Approximation Argument


meta infos for this blog

Source: html

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. [sent-1, score-0.325]

2 This is a serious argument which deserves a serious reply. [sent-2, score-0.535]

3 The approximation argument is a serious reply for which I have not yet seen a reply 2 . [sent-3, score-1.007]

4 The idea for the Bayesian approach is quite simple, elegant, and general. [sent-4, score-0.077]

5 The Bayesian method is a straightforward extension of the engineering method for designing a solution to a problem. [sent-8, score-0.354]

6 The three information sources are prior P(D) , data x , and loss, but loss only interacts with P(D) and x via the posterior P(D|x) . [sent-10, score-0.978]

7 The basic claim of the approximation argument is that approximation is unavoidable in all real-world problems that we care about. [sent-12, score-1.036]

8 There are several ways in which approximation necessarily invades applications of Bayes Law. [sent-13, score-0.383]

9 When specifying the prior, the number of bits needed to describe the “real” P(D) is typically too large. [sent-14, score-0.218]

10 What happens instead is that people take short-cuts specifying something which isn’t quite the real prior. [sent-16, score-0.308]

11 Even if the real P(D) is somehow specifiable, computing the posterior P(D|x) is often computationally intractable. [sent-17, score-0.584]

12 Again, the common short-cut is to alter the prior so as to make it computationally tractable. [sent-18, score-0.426]

13 (There are a few people who instead attempt to approximately compute the posterior via monte carlo methods. [sent-19, score-0.682]

14 A “real” prior P(D) (according to some definitions) might involve a distribution over the placement of air molecules, the shape of the throat producing the sound, and what is being pronounced. [sent-21, score-0.709]

15 This prior might be both inarticulable (prior elicitation is nontrivial) and unrepresentable (because too many bits are required to store on a modern machine). [sent-22, score-0.532]

16 If the necessity of approximation is accepted, the question becomes “what do you do about it? [sent-23, score-0.448]

17 Avoid approximation and work (or at least work a computer) very hard. [sent-26, score-0.383]

18 Use an approximate Bayesian method and leave a test set on the side to sanity check results. [sent-28, score-0.214]

19 Violate the modularity of loss and attempt to minimize approximation errors near the decision boundary. [sent-30, score-0.679]

20 There seems to be little deep understanding of the viability and universality of this approach but there are examples where this approach can provide significant benefits. [sent-31, score-0.154]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('approximation', 0.383), ('posterior', 0.305), ('prior', 0.273), ('bayesian', 0.215), ('argument', 0.195), ('reply', 0.151), ('loss', 0.138), ('method', 0.134), ('producing', 0.129), ('serious', 0.127), ('real', 0.126), ('specifying', 0.114), ('data', 0.11), ('bits', 0.104), ('turn', 0.1), ('bayes', 0.095), ('attempt', 0.093), ('extension', 0.086), ('deserves', 0.086), ('interacts', 0.086), ('shape', 0.086), ('specifiable', 0.086), ('computationally', 0.086), ('according', 0.081), ('store', 0.08), ('minimizes', 0.08), ('sanity', 0.08), ('placement', 0.08), ('approach', 0.077), ('thousands', 0.075), ('condition', 0.075), ('carlo', 0.075), ('elicitation', 0.075), ('monte', 0.075), ('unavoidable', 0.075), ('distribution', 0.072), ('unsurprising', 0.072), ('violate', 0.072), ('air', 0.069), ('fly', 0.069), ('reused', 0.069), ('nontrivial', 0.069), ('instead', 0.068), ('alter', 0.067), ('somehow', 0.067), ('via', 0.066), ('way', 0.065), ('necessity', 0.065), ('modularity', 0.065), ('elegant', 0.063)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

2 0.35744467 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

3 0.23978557 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

4 0.15926009 34 hunch net-2005-03-02-Prior, “Prior” and Bias

Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt

5 0.15615965 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

Introduction: I have recently completed a 500+ page-book on MDL , the first comprehensive overview of the field (yes, this is a sneak advertisement ). Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are: In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron , MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes! In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by somethin

6 0.1554908 217 hunch net-2006-11-06-Data Linkage Problems

7 0.1550283 347 hunch net-2009-03-26-Machine Learning is too easy

8 0.15408714 237 hunch net-2007-04-02-Contextual Scaling

9 0.15272039 8 hunch net-2005-02-01-NIPS: Online Bayes

10 0.1517214 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

11 0.13986199 9 hunch net-2005-02-01-Watchword: Loss

12 0.13535511 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

13 0.13531291 160 hunch net-2006-03-02-Why do people count for learning?

14 0.12747557 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

15 0.12580542 12 hunch net-2005-02-03-Learning Theory, by assumption

16 0.1252507 259 hunch net-2007-08-19-Choice of Metrics

17 0.12298875 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

18 0.11949526 95 hunch net-2005-07-14-What Learning Theory might do

19 0.11929329 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

20 0.11640944 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.255), (1, 0.153), (2, 0.027), (3, -0.017), (4, -0.054), (5, -0.012), (6, -0.049), (7, 0.109), (8, 0.225), (9, -0.08), (10, 0.007), (11, -0.05), (12, 0.038), (13, -0.009), (14, 0.105), (15, -0.113), (16, -0.074), (17, -0.062), (18, 0.032), (19, -0.096), (20, 0.002), (21, -0.012), (22, 0.128), (23, -0.034), (24, 0.012), (25, 0.045), (26, 0.015), (27, 0.076), (28, 0.086), (29, 0.017), (30, 0.009), (31, -0.127), (32, 0.034), (33, 0.094), (34, 0.125), (35, -0.104), (36, 0.003), (37, -0.111), (38, -0.072), (39, 0.162), (40, 0.027), (41, 0.015), (42, 0.058), (43, 0.041), (44, 0.053), (45, -0.017), (46, 0.079), (47, -0.066), (48, 0.029), (49, -0.06)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98434991 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

2 0.84142643 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

3 0.79941356 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

Introduction: I have recently completed a 500+ page-book on MDL , the first comprehensive overview of the field (yes, this is a sneak advertisement ). Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are: In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron , MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes! In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by somethin

4 0.79613793 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

Introduction: A few weeks ago I read this . David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. But it appear that they are not, and that they can even be in conflict ! Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5. Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). Now consider a new piece of evidence A=”X is an odd number” Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5

5 0.66134405 217 hunch net-2006-11-06-Data Linkage Problems

Introduction: Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons. A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algo

6 0.64671987 235 hunch net-2007-03-03-All Models of Learning have Flaws

7 0.6417461 237 hunch net-2007-04-02-Contextual Scaling

8 0.62796187 34 hunch net-2005-03-02-Prior, “Prior” and Bias

9 0.62692857 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

10 0.61486965 160 hunch net-2006-03-02-Why do people count for learning?

11 0.59643334 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

12 0.54624832 39 hunch net-2005-03-10-Breaking Abstractions

13 0.53784692 253 hunch net-2007-07-06-Idempotent-capable Predictors

14 0.49993923 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

15 0.49683663 7 hunch net-2005-01-31-Watchword: Assumption

16 0.47844586 123 hunch net-2005-10-16-Complexity: It’s all in your head

17 0.46483177 205 hunch net-2006-09-07-Objective and subjective interpretations of probability

18 0.46139133 16 hunch net-2005-02-09-Intuitions from applied learning

19 0.45339116 347 hunch net-2009-03-26-Machine Learning is too easy

20 0.44609016 95 hunch net-2005-07-14-What Learning Theory might do


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.019), (10, 0.019), (27, 0.23), (38, 0.068), (48, 0.018), (53, 0.085), (55, 0.078), (77, 0.307), (94, 0.083)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.96568298 436 hunch net-2011-06-22-Ultra LDA

Introduction: Shravan and Alex ‘s LDA code is released . On a single machine, I’m not sure how it currently compares to the online LDA in VW , but the ability to effectively scale across very many machines is surely interesting.

2 0.964791 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

Introduction: This title is a lie, but it is a special lie which has a bit of truth. If n players each play each other, you have a tournament. How do you order the players from weakest to strongest? The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better). One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith , Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is

same-blog 3 0.93571371 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

4 0.93164504 405 hunch net-2010-08-21-Rob Schapire at NYC ML Meetup

Introduction: I’ve been wanting to attend the NYC ML Meetup for some time and hope to make it next week on the 25th . Rob Schapire is talking about “Playing Repeated Games”, which in my experience is far more relevant to machine learning than the title might indicate.

5 0.89323491 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

6 0.84964895 388 hunch net-2010-01-24-Specializations of the Master Problem

7 0.8399415 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

8 0.83223587 375 hunch net-2009-10-26-NIPS workshops

9 0.77105802 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

10 0.73319542 392 hunch net-2010-03-26-A Variance only Deviation Bound

11 0.73006821 259 hunch net-2007-08-19-Choice of Metrics

12 0.7124303 235 hunch net-2007-03-03-All Models of Learning have Flaws

13 0.70370233 118 hunch net-2005-10-07-On-line learning of regular decision rules

14 0.70284069 237 hunch net-2007-04-02-Contextual Scaling

15 0.70231104 131 hunch net-2005-11-16-The Everything Ensemble Edge

16 0.6998589 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

17 0.69265777 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

18 0.69154578 258 hunch net-2007-08-12-Exponentiated Gradient

19 0.68586415 100 hunch net-2005-08-04-Why Reinforcement Learning is Important

20 0.68566614 293 hunch net-2008-03-23-Interactive Machine Learning