hunch_net hunch_net-2008 hunch_net-2008-317 knowledge-graph by maker-knowledge-mining

317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?


meta infos for this blog

Source: html

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). [sent-1, score-0.99]

2 The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. [sent-3, score-0.512]

3 I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. [sent-4, score-0.432]

4 One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. [sent-5, score-0.664]

5 For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0. [sent-6, score-0.541]

6 5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . [sent-7, score-1.199]

7 The fact that all analyses have poor dependence on k is troublesome. [sent-9, score-0.667]

8 The lower bounds in the EXP4 paper and the Offset Tree paper demonstrate that this isn’t a matter of lazy proof writing or a poor choice of algorithms: it’s essential to the nature of the problem. [sent-10, score-0.352]

9 In supervised learning, it’s typical to get no dependence or very weak dependence on the number of actions/choices/labels. [sent-11, score-1.12]

10 For example, if we do empirical risk minimization over a finite hypothesis space H , the dependence is at most ln |H| using an Occam’s Razor bound. [sent-12, score-0.476]

11 Similarly, the PECOC algorithm (page 12) has dependence bounded by a constant. [sent-13, score-0.476]

12 This kind of dependence is great for the feasibility of machine learning: it means that we can hope to tackle seemingly difficult problems. [sent-14, score-0.607]

13 At the level of this discussion, they differ only in step 3, where for supervised learning, all of the rewards are revealed instead of just one. [sent-16, score-0.364]

14 For example, consider the following problem(*): “Find an action with average reward greater than 0. [sent-20, score-0.512]

15 99″ and consider two algorithms: Sample actions at random until we can prove (via Hoeffding bounds) that one of them has large reward. [sent-22, score-0.417]

16 Pick an action at random, sample it 100 times, and if we can prove (via a Hoeffding bound) that it has large average reward return it, otherwise pick another action randomly and repeat. [sent-23, score-0.968]

17 When there are 10 10 actions and 10 9 of them have average reward 0. [sent-24, score-0.616]

18 Here the idea is that you have access to a covering oracle on the actions where actions with similar average rewards cover each other. [sent-29, score-0.953]

19 Here the idea is that the values of actions are generated recursively, preserving structure through the recursion. [sent-31, score-0.435]

20 Basic questions : Are there other kinds of natural structure which allows a good dependence on the total number of actions? [sent-32, score-0.685]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('dependence', 0.476), ('actions', 0.309), ('action', 0.205), ('rewards', 0.198), ('reward', 0.17), ('hoeffding', 0.153), ('average', 0.137), ('kleinberg', 0.136), ('bandit', 0.135), ('announces', 0.126), ('poor', 0.123), ('features', 0.115), ('offset', 0.109), ('prove', 0.108), ('supervised', 0.103), ('round', 0.102), ('chooses', 0.096), ('labeling', 0.094), ('bounds', 0.091), ('partial', 0.09), ('settings', 0.083), ('situations', 0.082), ('kinds', 0.081), ('pick', 0.081), ('via', 0.078), ('setting', 0.076), ('page', 0.073), ('policy', 0.071), ('tree', 0.07), ('lower', 0.07), ('times', 0.069), ('holistic', 0.068), ('recursively', 0.068), ('seemingly', 0.068), ('adversarially', 0.068), ('analyses', 0.068), ('bobby', 0.068), ('demonstrate', 0.068), ('epsilon', 0.068), ('untrue', 0.068), ('number', 0.065), ('maximizes', 0.063), ('preserving', 0.063), ('revealed', 0.063), ('pecoc', 0.063), ('tackle', 0.063), ('deepak', 0.063), ('revealing', 0.063), ('structure', 0.063), ('sample', 0.062)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000004 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a

2 0.32546505 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

3 0.27607861 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

4 0.22767638 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

Introduction: There were two papers at ICML presenting learning algorithms for a contextual bandit -style setting, where the loss for all labels is not known, but the loss for one label is known. (The first might require a exploration scavenging viewpoint to understand if the experimental assignment was nonrandom.) I strongly approve of these papers and further work in this setting and its variants, because I expect it to become more important than supervised learning. As a quick review, we are thinking about situations where repeatedly: The world reveals feature values (aka context information). A policy chooses an action. The world provides a reward. Sometimes this is done in an online fashion where the policy can change based on immediate feedback and sometimes it’s done in a batch setting where many samples are collected before the policy can change. If you haven’t spent time thinking about the setting, you might want to because there are many natural applications. I’m g

5 0.19247508 400 hunch net-2010-06-13-The Good News on Exploration and Learning

Introduction: Consider the contextual bandit setting where, repeatedly: A context x is observed. An action a is taken given the context x . A reward r is observed, dependent on x and a . Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward. This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions. A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and but

6 0.17182174 392 hunch net-2010-03-26-A Variance only Deviation Bound

7 0.15732557 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

8 0.13271755 183 hunch net-2006-06-14-Explorations of Exploration

9 0.12620668 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

10 0.12314834 343 hunch net-2009-02-18-Decision by Vetocracy

11 0.12198175 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

12 0.1175918 309 hunch net-2008-07-10-Interesting papers, ICML 2008

13 0.11749483 293 hunch net-2008-03-23-Interactive Machine Learning

14 0.11455394 351 hunch net-2009-05-02-Wielding a New Abstraction

15 0.1064685 100 hunch net-2005-08-04-Why Reinforcement Learning is Important

16 0.10202914 41 hunch net-2005-03-15-The State of Tight Bounds

17 0.087522238 360 hunch net-2009-06-15-In Active Learning, the question changes

18 0.087461777 235 hunch net-2007-03-03-All Models of Learning have Flaws

19 0.084642932 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary

20 0.084065273 432 hunch net-2011-04-20-The End of the Beginning of Active Learning


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.222), (1, 0.112), (2, 0.024), (3, -0.012), (4, 0.124), (5, -0.077), (6, 0.082), (7, -0.029), (8, -0.02), (9, 0.087), (10, 0.143), (11, 0.032), (12, 0.033), (13, 0.109), (14, -0.138), (15, 0.091), (16, -0.008), (17, -0.091), (18, -0.0), (19, 0.138), (20, -0.073), (21, -0.058), (22, -0.01), (23, -0.004), (24, -0.107), (25, 0.099), (26, -0.2), (27, -0.031), (28, -0.032), (29, -0.016), (30, -0.114), (31, -0.082), (32, 0.128), (33, 0.08), (34, 0.012), (35, -0.037), (36, -0.053), (37, 0.028), (38, -0.009), (39, 0.046), (40, -0.043), (41, -0.036), (42, -0.012), (43, 0.003), (44, -0.015), (45, 0.021), (46, 0.086), (47, -0.014), (48, 0.042), (49, 0.028)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96869683 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a

2 0.83730292 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

3 0.82864434 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

4 0.75626302 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

Introduction: There were two papers at ICML presenting learning algorithms for a contextual bandit -style setting, where the loss for all labels is not known, but the loss for one label is known. (The first might require a exploration scavenging viewpoint to understand if the experimental assignment was nonrandom.) I strongly approve of these papers and further work in this setting and its variants, because I expect it to become more important than supervised learning. As a quick review, we are thinking about situations where repeatedly: The world reveals feature values (aka context information). A policy chooses an action. The world provides a reward. Sometimes this is done in an online fashion where the policy can change based on immediate feedback and sometimes it’s done in a batch setting where many samples are collected before the policy can change. If you haven’t spent time thinking about the setting, you might want to because there are many natural applications. I’m g

5 0.7552284 400 hunch net-2010-06-13-The Good News on Exploration and Learning

Introduction: Consider the contextual bandit setting where, repeatedly: A context x is observed. An action a is taken given the context x . A reward r is observed, dependent on x and a . Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward. This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions. A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and but

6 0.65941268 220 hunch net-2006-11-27-Continuizing Solutions

7 0.6401751 392 hunch net-2010-03-26-A Variance only Deviation Bound

8 0.52464426 183 hunch net-2006-06-14-Explorations of Exploration

9 0.52320868 309 hunch net-2008-07-10-Interesting papers, ICML 2008

10 0.51497912 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

11 0.49686679 169 hunch net-2006-04-05-What is state?

12 0.49191073 293 hunch net-2008-03-23-Interactive Machine Learning

13 0.487905 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

14 0.47494879 87 hunch net-2005-06-29-Not EM for clustering at COLT

15 0.47071868 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

16 0.45898432 351 hunch net-2009-05-02-Wielding a New Abstraction

17 0.45155311 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

18 0.44433257 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction

19 0.44409379 148 hunch net-2006-01-13-Benchmarks for RL

20 0.4335297 286 hunch net-2008-01-25-Turing’s Club for Machine Learning


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.021), (27, 0.276), (38, 0.044), (53, 0.062), (54, 0.015), (55, 0.058), (73, 0.161), (77, 0.167), (94, 0.077), (95, 0.033)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.93582118 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a

2 0.89836127 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

3 0.89765555 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

Introduction: This title is a lie, but it is a special lie which has a bit of truth. If n players each play each other, you have a tournament. How do you order the players from weakest to strongest? The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better). One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith , Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is

4 0.89410281 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

5 0.89130324 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

6 0.86635554 257 hunch net-2007-07-28-Asking questions

7 0.83295006 375 hunch net-2009-10-26-NIPS workshops

8 0.82465261 259 hunch net-2007-08-19-Choice of Metrics

9 0.81996095 235 hunch net-2007-03-03-All Models of Learning have Flaws

10 0.81534201 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

11 0.81358594 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

12 0.81151426 258 hunch net-2007-08-12-Exponentiated Gradient

13 0.80958533 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

14 0.80795556 183 hunch net-2006-06-14-Explorations of Exploration

15 0.80788672 436 hunch net-2011-06-22-Ultra LDA

16 0.80738777 220 hunch net-2006-11-27-Continuizing Solutions

17 0.80588186 293 hunch net-2008-03-23-Interactive Machine Learning

18 0.80405623 351 hunch net-2009-05-02-Wielding a New Abstraction

19 0.80227536 12 hunch net-2005-02-03-Learning Theory, by assumption

20 0.80073762 237 hunch net-2007-04-02-Contextual Scaling