hunch_net hunch_net-2005 hunch_net-2005-8 knowledge-graph by maker-knowledge-mining

8 hunch net-2005-02-01-NIPS: Online Bayes


meta infos for this blog

Source: html

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. [sent-1, score-0.732]

2 I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . [sent-2, score-0.184]

3 From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. [sent-3, score-0.733]

4 the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. [sent-4, score-1.393]

5 We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). [sent-5, score-1.651]

6 It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. [sent-6, score-0.839]

7 Then for all sequences S: L_{Bayes}(S) leq L_Q(S) + KL(Q|P) where P is our prior and the losses L are: first, log-loss for the Bayes algorithm (run online) and second, expected log-loss with respect to an arbitrary distribution Q. [sent-7, score-0.596]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('online', 0.226), ('bayesian', 0.222), ('adversarial', 0.214), ('nice', 0.204), ('setting', 0.198), ('bayes', 0.195), ('provide', 0.161), ('bounds', 0.159), ('kl', 0.155), ('distribution', 0.149), ('beautiful', 0.142), ('learner', 0.142), ('philosophy', 0.142), ('sequences', 0.142), ('performance', 0.14), ('methodology', 0.137), ('model', 0.135), ('odds', 0.133), ('appeared', 0.133), ('kakade', 0.129), ('ng', 0.129), ('usually', 0.129), ('conditions', 0.126), ('minimal', 0.126), ('favorite', 0.123), ('algorithms', 0.121), ('competitive', 0.12), ('losses', 0.12), ('sham', 0.12), ('thoughts', 0.117), ('compare', 0.115), ('andrew', 0.113), ('study', 0.109), ('presented', 0.105), ('recent', 0.105), ('arbitrary', 0.102), ('discuss', 0.102), ('experts', 0.102), ('parameters', 0.1), ('enjoyed', 0.099), ('nature', 0.098), ('blog', 0.096), ('assumptions', 0.095), ('taken', 0.095), ('papers', 0.092), ('read', 0.088), ('theorem', 0.085), ('paper', 0.085), ('expected', 0.083), ('let', 0.082)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

2 0.2463226 252 hunch net-2007-07-01-Watchword: Online Learning

Introduction: It turns out that many different people use the term “Online Learning”, and often they don’t have the same definition in mind. Here’s a list of the possibilities I know of. Online Information Setting Online learning refers to a problem in which unlabeled data comes, a prediction is made, and then feedback is acquired. Online Adversarial Setting Online learning refers to algorithms in the Online Information Setting which satisfy guarantees of the form: “For all possible sequences of observations, the algorithim has regret at most log ( number of strategies) with respect to the best strategy in a set.” This is sometimes called online learning with experts. Online Optimization Constraint Online learning refers to optimizing a predictor via a learning algorithm tunes parameters on a per-example basis. This may or may not be applied in the Online Information Setting, and the strategy may or may not satisfy Adversarial setting theory. Online Computational Constra

3 0.18834412 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

4 0.16700713 267 hunch net-2007-10-17-Online as the new adjective

Introduction: Online learning is in vogue, which means we should expect to see in the near future: Online boosting. Online decision trees. Online SVMs. (actually, we’ve already seen) Online deep learning. Online parallel learning. etc… There are three fundamental drivers of this trend. Increasing size of datasets makes online algorithms attractive. Online learning can simply be more efficient than batch learning. Here is a picture from a class on online learning: The point of this picture is that even in 3 dimensions and even with linear constraints, finding the minima of a set in an online fashion can be typically faster than finding the minima in a batch fashion. To see this, note that there is a minimal number of gradient updates (i.e. 2) required in order to reach the minima in the typical case. Given this, it’s best to do these updates as quickly as possible, which implies doing the first update online (i.e. before seeing all the examples) is preferred. Note

5 0.16155382 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

6 0.15562496 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

7 0.15341231 127 hunch net-2005-11-02-Progress in Active Learning

8 0.15272039 165 hunch net-2006-03-23-The Approximation Argument

9 0.15162225 12 hunch net-2005-02-03-Learning Theory, by assumption

10 0.1503589 258 hunch net-2007-08-12-Exponentiated Gradient

11 0.13835658 251 hunch net-2007-06-24-Interesting Papers at ICML 2007

12 0.13749221 28 hunch net-2005-02-25-Problem: Online Learning

13 0.13695061 309 hunch net-2008-07-10-Interesting papers, ICML 2008

14 0.13550368 34 hunch net-2005-03-02-Prior, “Prior” and Bias

15 0.13054883 186 hunch net-2006-06-24-Online convex optimization at COLT

16 0.13005371 235 hunch net-2007-03-03-All Models of Learning have Flaws

17 0.12639891 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

18 0.12567431 188 hunch net-2006-06-30-ICML papers

19 0.12461719 133 hunch net-2005-11-28-A question of quantification

20 0.11893846 347 hunch net-2009-03-26-Machine Learning is too easy


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.248), (1, 0.144), (2, 0.07), (3, -0.07), (4, 0.102), (5, -0.014), (6, -0.067), (7, -0.123), (8, 0.14), (9, 0.07), (10, 0.25), (11, 0.027), (12, 0.069), (13, -0.105), (14, 0.149), (15, -0.036), (16, -0.016), (17, -0.071), (18, 0.087), (19, -0.112), (20, -0.02), (21, -0.118), (22, 0.028), (23, -0.134), (24, 0.128), (25, 0.053), (26, 0.019), (27, 0.022), (28, 0.051), (29, 0.047), (30, 0.067), (31, 0.025), (32, 0.009), (33, -0.002), (34, 0.057), (35, -0.027), (36, 0.066), (37, 0.002), (38, 0.02), (39, -0.039), (40, 0.018), (41, 0.045), (42, -0.02), (43, -0.038), (44, 0.014), (45, -0.095), (46, 0.038), (47, 0.042), (48, -0.009), (49, 0.036)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98594779 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

2 0.70372909 252 hunch net-2007-07-01-Watchword: Online Learning

Introduction: It turns out that many different people use the term “Online Learning”, and often they don’t have the same definition in mind. Here’s a list of the possibilities I know of. Online Information Setting Online learning refers to a problem in which unlabeled data comes, a prediction is made, and then feedback is acquired. Online Adversarial Setting Online learning refers to algorithms in the Online Information Setting which satisfy guarantees of the form: “For all possible sequences of observations, the algorithim has regret at most log ( number of strategies) with respect to the best strategy in a set.” This is sometimes called online learning with experts. Online Optimization Constraint Online learning refers to optimizing a predictor via a learning algorithm tunes parameters on a per-example basis. This may or may not be applied in the Online Information Setting, and the strategy may or may not satisfy Adversarial setting theory. Online Computational Constra

3 0.69225442 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

Introduction: Accountability is a social problem. When someone screws up, do you fire them? Or do you accept the error and let them continue? This is a very difficult problem and we all know of stories where the wrong decision was made. Online learning (as meant here), is a subfield of learning theory which analyzes the online learning model. In the online learning model, there are a set of hypotheses or “experts”. On any instantance x , each expert makes a prediction y . A master algorithm A uses these predictions to form it’s own prediction y A and then learns the correct prediction y * . This process repeats. The goal of online learning is to find a master algorithm A which uses the advice of the experts to make good predictions. In particular, we typically want to guarantee that the master algorithm performs almost as well as the best expert. If L(e) is the loss of expert e and L(A) is the loss of the master algorithm, it is often possible to prove: L(A) les

4 0.64723176 267 hunch net-2007-10-17-Online as the new adjective

Introduction: Online learning is in vogue, which means we should expect to see in the near future: Online boosting. Online decision trees. Online SVMs. (actually, we’ve already seen) Online deep learning. Online parallel learning. etc… There are three fundamental drivers of this trend. Increasing size of datasets makes online algorithms attractive. Online learning can simply be more efficient than batch learning. Here is a picture from a class on online learning: The point of this picture is that even in 3 dimensions and even with linear constraints, finding the minima of a set in an online fashion can be typically faster than finding the minima in a batch fashion. To see this, note that there is a minimal number of gradient updates (i.e. 2) required in order to reach the minima in the typical case. Given this, it’s best to do these updates as quickly as possible, which implies doing the first update online (i.e. before seeing all the examples) is preferred. Note

5 0.63943005 133 hunch net-2005-11-28-A question of quantification

Introduction: This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same. For all sequences of examples . This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.” For all training sets . This is the standard quantification for boosting analysis such as adaboost or multiclass boosting . Standard theorems have the form “for all training sets the error rate inequalities … hold”. For all distributions over examples . This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”. It is not quite true that each of these is equivalent. F

6 0.62782544 251 hunch net-2007-06-24-Interesting Papers at ICML 2007

7 0.6109755 34 hunch net-2005-03-02-Prior, “Prior” and Bias

8 0.59235859 28 hunch net-2005-02-25-Problem: Online Learning

9 0.58544171 385 hunch net-2009-12-27-Interesting things at NIPS 2009

10 0.58372641 309 hunch net-2008-07-10-Interesting papers, ICML 2008

11 0.56715149 378 hunch net-2009-11-15-The Other Online Learning

12 0.5469414 163 hunch net-2006-03-12-Online learning or online preservation of learning?

13 0.54226005 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

14 0.54114991 188 hunch net-2006-06-30-ICML papers

15 0.5389325 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

16 0.53374761 186 hunch net-2006-06-24-Online convex optimization at COLT

17 0.50255096 12 hunch net-2005-02-03-Learning Theory, by assumption

18 0.50217628 308 hunch net-2008-07-06-To Dual or Not

19 0.49637112 101 hunch net-2005-08-08-Apprenticeship Reinforcement Learning for Control

20 0.49614167 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.228), (9, 0.081), (27, 0.363), (38, 0.02), (53, 0.067), (94, 0.054), (95, 0.084)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.93936574 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

2 0.91426176 76 hunch net-2005-05-29-Bad ideas

Introduction: I found these two essays on bad ideas interesting. Neither of these is written from the viewpoint of research, but they are both highly relevant. Why smart people have bad ideas by Paul Graham Why smart people defend bad ideas by Scott Berkun (which appeared on slashdot ) In my experience, bad ideas are common and over confidence in ideas is common. This overconfidence can take either the form of excessive condemnation or excessive praise. Some of this is necessary to the process of research. For example, some overconfidence in the value of your own research is expected and probably necessary to motivate your own investigation. Since research is a rather risky business, much of it does not pan out. Learning to accept when something does not pan out is a critical skill which is sometimes never acquired. Excessive condemnation can be a real ill when it’s encountered. This has two effects: When the penalty for being wrong is too large, it means people have a

3 0.90306515 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

Introduction: This post is about a technology which could develop in the future. Right now, a new drug might be tested by finding patients with some diagnosis and giving or not giving them a drug according to a secret randomization. The outcome is observed, and if the average outcome for those treated is measurably better than the average outcome for those not treated, the drug might become a standard treatment. Generalizing this, a filter F sorts people into two groups: those for treatment A and those not for treatment B based upon observations x . To measure the outcome, you randomize between treatment and nontreatment of group A and measure the relative performance of the treatment. A problem often arises: in many cases the treated group does not do better than the nontreated group. A basic question is: does this mean the treatment is bad? With respect to the filter F it may mean that, but with respect to another filter F’ , the treatment might be very effective. For exampl

4 0.86304682 39 hunch net-2005-03-10-Breaking Abstractions

Introduction: Sam Roweis ‘s comment reminds me of a more general issue that comes up in doing research: abstractions always break. Real number’s aren’t. Most real numbers can not be represented with any machine. One implication of this is that many real-number based algorithms have difficulties when implemented with floating point numbers. The box on your desk is not a turing machine. A turing machine can compute anything computable, given sufficient time. A typical computer fails terribly when the state required for the computation exceeds some limit. Nash equilibria aren’t equilibria. This comes up when trying to predict human behavior based on the result of the equilibria computation. Often, it doesn’t work. The probability isn’t. Probability is an abstraction expressing either our lack of knowledge (the Bayesian viewpoint) or fundamental randomization (the frequentist viewpoint). From the frequentist viewpoint the lack of knowledge typically precludes actually knowing the fu

5 0.84222335 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r

6 0.83205158 483 hunch net-2013-06-10-The Large Scale Learning class notes

7 0.82320535 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

8 0.82188457 9 hunch net-2005-02-01-Watchword: Loss

9 0.82039237 352 hunch net-2009-05-06-Machine Learning to AI

10 0.82023621 400 hunch net-2010-06-13-The Good News on Exploration and Learning

11 0.82010049 45 hunch net-2005-03-22-Active learning

12 0.81851673 432 hunch net-2011-04-20-The End of the Beginning of Active Learning

13 0.81727642 360 hunch net-2009-06-15-In Active Learning, the question changes

14 0.81595713 67 hunch net-2005-05-06-Don’t mix the solution into the problem

15 0.81368339 172 hunch net-2006-04-14-JMLR is a success

16 0.81329197 258 hunch net-2007-08-12-Exponentiated Gradient

17 0.81236285 308 hunch net-2008-07-06-To Dual or Not

18 0.81015325 41 hunch net-2005-03-15-The State of Tight Bounds

19 0.80968732 12 hunch net-2005-02-03-Learning Theory, by assumption

20 0.80812782 14 hunch net-2005-02-07-The State of the Reduction