hunch_net hunch_net-2005 hunch_net-2005-60 knowledge-graph by maker-knowledge-mining

60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning


meta infos for this blog

Source: html

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. [sent-2, score-0.489]

2 Integrate using Bayes law with respect to all observed information to compute a posterior over world models. [sent-3, score-0.441]

3 Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. [sent-5, score-0.56]

4 ) When creating an engineered system, you build a model of the world and then find a good controller in that model. [sent-9, score-0.256]

5 Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. [sent-10, score-0.72]

6 What this means is that a recipe of “think harder” (about specifying a prior over world models) and “compute harder” (to calculate a posterior) will eventually succeed. [sent-11, score-0.731]

7 Language Bayesian and near-Bayesian methods have an associated language for specifying priors and posteriors. [sent-13, score-0.537]

8 This is significantly helpful when working on the “think harder” part of a solution. [sent-14, score-0.131]

9 Intuitions Bayesian learning involves specifying a prior and integration, two activities which seem to be universally useful. [sent-15, score-0.696]

10 Information theoretically infeasible It turns out that specifying a prior is extremely difficult. [sent-19, score-0.8]

11 Roughly speaking, we must specify a real number for every setting of the world model parameters. [sent-20, score-0.519]

12 Many people well-versed in Bayesian learning don’t notice this difficulty for two reasons: They know languages allowing more compact specification of priors. [sent-21, score-0.408]

13 They don’t specify their actual prior, but rather one which is convenient. [sent-24, score-0.201]

14 ) Computationally infeasible Let’s suppose I could accurately specify a prior over every air molecule in a room. [sent-26, score-0.834]

15 Even then, computing a posterior may be extremely difficult. [sent-27, score-0.276]

16 This difficulty implies that computational approximation is required. [sent-28, score-0.078]

17 It guarantees that as long as new learning problems exist, there will be a need for Bayesian engineers to solve them. [sent-30, score-0.146]

18 ( Zoubin likes to counter that a superprior over all priors can be employed for automation, but this seems to add to the other disadvantages. [sent-31, score-0.369]

19 ) Overall, if a learning problem must be solved a Bayesian should probably be working on it and has a good chance of solving it. [sent-32, score-0.122]

20 I wish I knew whether or not the drawbacks can be convincingly addressed. [sent-33, score-0.133]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('bayesian', 0.53), ('prior', 0.269), ('specifying', 0.224), ('specify', 0.201), ('posterior', 0.18), ('harder', 0.179), ('interpolate', 0.17), ('world', 0.159), ('infeasible', 0.149), ('priors', 0.131), ('intuitions', 0.124), ('advantages', 0.108), ('compute', 0.102), ('methods', 0.099), ('model', 0.097), ('extremely', 0.096), ('delta', 0.085), ('accurately', 0.085), ('counter', 0.085), ('employment', 0.085), ('engineers', 0.085), ('interpolation', 0.085), ('signficant', 0.085), ('language', 0.083), ('employed', 0.079), ('automation', 0.079), ('puts', 0.079), ('recipe', 0.079), ('difficulty', 0.078), ('zoubin', 0.074), ('likes', 0.074), ('activities', 0.074), ('shouldn', 0.074), ('convincingly', 0.071), ('acquiring', 0.071), ('compact', 0.071), ('part', 0.07), ('think', 0.07), ('air', 0.068), ('universally', 0.068), ('specification', 0.066), ('languages', 0.066), ('notice', 0.066), ('integrate', 0.064), ('every', 0.062), ('theoretically', 0.062), ('knew', 0.062), ('working', 0.061), ('learning', 0.061), ('integration', 0.06)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999988 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

2 0.35744467 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

3 0.24044935 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

4 0.20711625 95 hunch net-2005-07-14-What Learning Theory might do

Introduction: I wanted to expand on this post and some of the previous problems/research directions about where learning theory might make large strides. Why theory? The essential reason for theory is “intuition extension”. A very good applied learning person can master some particular application domain yielding the best computer algorithms for solving that problem. A very good theory can take the intuitions discovered by this and other applied learning people and extend them to new domains in a relatively automatic fashion. To do this, we take these basic intuitions and try to find a mathematical model that: Explains the basic intuitions. Makes new testable predictions about how to learn. Succeeds in so learning. This is “intuition extension”: taking what we have learned somewhere else and applying it in new domains. It is fundamentally useful to everyone because it increases the level of automation in solving problems. Where next for learning theory? I like the a

5 0.18834412 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

6 0.16482638 34 hunch net-2005-03-02-Prior, “Prior” and Bias

7 0.16455606 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

8 0.16370802 237 hunch net-2007-04-02-Contextual Scaling

9 0.15996361 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

10 0.13470075 5 hunch net-2005-01-26-Watchword: Probability

11 0.13017036 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

12 0.12485612 16 hunch net-2005-02-09-Intuitions from applied learning

13 0.11081854 289 hunch net-2008-02-17-The Meaning of Confidence

14 0.11000811 347 hunch net-2009-03-26-Machine Learning is too easy

15 0.10998069 39 hunch net-2005-03-10-Breaking Abstractions

16 0.10020778 160 hunch net-2006-03-02-Why do people count for learning?

17 0.097924292 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

18 0.097430393 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design

19 0.09627381 217 hunch net-2006-11-06-Data Linkage Problems

20 0.09549056 253 hunch net-2007-07-06-Idempotent-capable Predictors


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.221), (1, 0.108), (2, -0.031), (3, 0.031), (4, 0.004), (5, -0.051), (6, 0.003), (7, 0.098), (8, 0.254), (9, -0.081), (10, 0.003), (11, -0.138), (12, 0.032), (13, -0.03), (14, 0.189), (15, -0.095), (16, -0.069), (17, -0.076), (18, 0.111), (19, -0.15), (20, -0.034), (21, -0.02), (22, 0.148), (23, -0.043), (24, 0.014), (25, 0.074), (26, -0.012), (27, 0.02), (28, 0.036), (29, 0.096), (30, 0.037), (31, -0.038), (32, 0.113), (33, 0.044), (34, 0.064), (35, -0.024), (36, -0.059), (37, -0.169), (38, -0.075), (39, 0.169), (40, 0.008), (41, -0.044), (42, 0.001), (43, -0.031), (44, 0.076), (45, -0.02), (46, 0.108), (47, -0.092), (48, -0.02), (49, -0.005)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96979827 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

2 0.82299626 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

3 0.80438548 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

Introduction: A few weeks ago I read this . David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. But it appear that they are not, and that they can even be in conflict ! Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5. Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). Now consider a new piece of evidence A=”X is an odd number” Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5

4 0.74863178 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

Introduction: I have recently completed a 500+ page-book on MDL , the first comprehensive overview of the field (yes, this is a sneak advertisement ). Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are: In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron , MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes! In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by somethin

5 0.65878075 34 hunch net-2005-03-02-Prior, “Prior” and Bias

Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt

6 0.63576555 39 hunch net-2005-03-10-Breaking Abstractions

7 0.58626676 237 hunch net-2007-04-02-Contextual Scaling

8 0.58322847 235 hunch net-2007-03-03-All Models of Learning have Flaws

9 0.54685324 160 hunch net-2006-03-02-Why do people count for learning?

10 0.53776085 95 hunch net-2005-07-14-What Learning Theory might do

11 0.52452332 123 hunch net-2005-10-16-Complexity: It’s all in your head

12 0.50979573 5 hunch net-2005-01-26-Watchword: Probability

13 0.49687639 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

14 0.49324712 217 hunch net-2006-11-06-Data Linkage Problems

15 0.4820964 253 hunch net-2007-07-06-Idempotent-capable Predictors

16 0.47875872 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

17 0.46088216 205 hunch net-2006-09-07-Objective and subjective interpretations of probability

18 0.45298597 135 hunch net-2005-12-04-Watchword: model

19 0.45129463 16 hunch net-2005-02-09-Intuitions from applied learning

20 0.41942492 8 hunch net-2005-02-01-NIPS: Online Bayes


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(27, 0.19), (38, 0.069), (48, 0.01), (53, 0.17), (55, 0.049), (77, 0.098), (78, 0.235), (94, 0.062), (95, 0.02)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.85992074 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

2 0.79763925 297 hunch net-2008-04-22-Taking the next step

Introduction: At the last ICML , Tom Dietterich asked me to look into systems for commenting on papers. I’ve been slow getting to this, but it’s relevant now. The essential observation is that we now have many tools for online collaboration, but they are not yet much used in academic research. If we can find the right way to use them, then perhaps great things might happen, with extra kudos to the first conference that manages to really create an online community. Various conferences have been poking at this. For example, UAI has setup a wiki , COLT has started using Joomla , with some dynamic content, and AAAI has been setting up a “ student blog “. Similarly, Dinoj Surendran setup a twiki for the Chicago Machine Learning Summer School , which was quite useful for coordinating events and other things. I believe the most important thing is a willingness to experiment. A good place to start seems to be enhancing existing conference websites. For example, the ICML 2007 papers pag

3 0.79639095 316 hunch net-2008-09-04-Fall ML Conferences

Introduction: If you are in the New York area and interested in machine learning, consider submitting a 2 page abstract to the ML symposium by tomorrow (Sept 5th) midnight. It’s a fun one day affair on October 10 in an awesome location overlooking the world trade center site. A bit further off (but a real conference) is the AI and Stats deadline on November 5, to be held in Florida April 16-19.

4 0.78000033 441 hunch net-2011-08-15-Vowpal Wabbit 6.0

Introduction: I just released Vowpal Wabbit 6.0 . Since the last version: VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh . Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point. The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs. The new matrix factoriz

5 0.75931633 164 hunch net-2006-03-17-Multitask learning is Black-Boxable

Introduction: Multitask learning is the problem of jointly predicting multiple labels simultaneously with one system. A basic question is whether or not multitask learning can be decomposed into one (or more) single prediction problems . It seems the answer to this is “yes”, in a fairly straightforward manner. The basic idea is that a controlled input feature is equivalent to an extra output. Suppose we have some process generating examples: (x,y 1 ,y 2 ) in S where y 1 and y 2 are labels for two different tasks. Then, we could reprocess the data to the form S b (S) = {((x,i),y i ): (x,y 1 ,y 2 ) in S, i in {1,2}} and then learn a classifier c:X x {1,2} -> Y . Note that (x,i) is the (composite) input. At testing time, given an input x , we can query c for the predicted values of y 1 and y 2 using (x,1) and (x,2) . A strong form of equivalence can be stated between these tasks. In particular, suppose we have a multitask learning algorithm ML which learns a multitask

6 0.72027618 165 hunch net-2006-03-23-The Approximation Argument

7 0.70649177 131 hunch net-2005-11-16-The Everything Ensemble Edge

8 0.70372486 201 hunch net-2006-08-07-The Call of the Deep

9 0.68615842 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

10 0.68551296 388 hunch net-2010-01-24-Specializations of the Master Problem

11 0.68534762 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

12 0.6824522 269 hunch net-2007-10-24-Contextual Bandits

13 0.68174517 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

14 0.68163103 19 hunch net-2005-02-14-Clever Methods of Overfitting

15 0.68160993 21 hunch net-2005-02-17-Learning Research Programs

16 0.68143564 6 hunch net-2005-01-27-Learning Complete Problems

17 0.68043792 12 hunch net-2005-02-03-Learning Theory, by assumption

18 0.67688036 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

19 0.67634839 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

20 0.67563772 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class