hunch_net hunch_net-2007 hunch_net-2007-269 knowledge-graph by maker-knowledge-mining

269 hunch net-2007-10-24-Contextual Bandits


meta infos for this blog

Source: html

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This has become much more effective due to targeted advertising where ads are specifically matched to interests. [sent-2, score-0.537]

2 Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. [sent-3, score-0.509]

3 The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. [sent-4, score-0.431]

4 A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. [sent-5, score-0.899]

5 On each round: A policy chooses arm a from 1 of k arms (i. [sent-9, score-0.658]

6 The world reveals the reward r a of the chosen arm (i. [sent-12, score-0.598]

7 As information is accumulated over multiple rounds, a good policy might converge on a good choice of arm (i. [sent-15, score-0.576]

8 This setting (and its variants) fails to capture a critical phenomenon: each of these displayed ads are done in the context of a search or other webpage. [sent-18, score-0.869]

9 To model this, we might think of a different setting where on each round: The world announces some context information x (think of this as a high dimensional bit vector if that helps). [sent-19, score-0.579]

10 A policy chooses arm a from 1 of k arms (i. [sent-20, score-0.658]

11 The world reveals the reward r a of the chosen arm (i. [sent-23, score-0.598]

12 First, note that policies using x can encode much more rich decisions than a policy not using x . [sent-27, score-0.416]

13 In contrast, good supervised learning algorithms often require information which is (essentially) independent of the number of contexts. [sent-32, score-0.453]

14 Take some set of policies and treat every policy h(x) as a different arm. [sent-33, score-0.42]

15 This removes an explicit dependence on the number of contexts, but it creates a linear dependence on the number of policies. [sent-34, score-0.706]

16 5 ) where T is the number of rounds and |H| is the number of policies. [sent-39, score-0.375]

17 ) This result is independent of the number of contexts x and only weakly dependent (similar to supervised learning) on the number of policies. [sent-41, score-0.591]

18 Tong and I worked out a reasonably simple meta-algorithm Epoch-Greedy which addresses these drawbacks (**), at the cost of sometimes worsening the regret bound to O(T 2/3 S 1/3 ) where S is related to the representational complexity of supervised learning on the set of policies. [sent-43, score-0.343]

19 This T dependence is of great concern to people who have worked on bandit problems in the past (where, basically, only the dependence on T could be optimized). [sent-44, score-0.696]

20 For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account. [sent-50, score-0.469]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('ads', 0.328), ('arm', 0.291), ('bandit', 0.269), ('ad', 0.243), ('bandits', 0.233), ('dependence', 0.185), ('policy', 0.182), ('clicked', 0.175), ('contextual', 0.157), ('setting', 0.151), ('context', 0.146), ('supervised', 0.146), ('number', 0.139), ('search', 0.125), ('policies', 0.124), ('information', 0.103), ('engines', 0.102), ('arms', 0.102), ('contexts', 0.102), ('rounds', 0.097), ('advertising', 0.093), ('reveals', 0.09), ('round', 0.087), ('drawbacks', 0.083), ('chooses', 0.083), ('chosen', 0.079), ('fundamental', 0.075), ('reward', 0.073), ('doesn', 0.069), ('contrast', 0.066), ('world', 0.065), ('independent', 0.065), ('critical', 0.061), ('different', 0.06), ('encode', 0.058), ('dividing', 0.058), ('displayed', 0.058), ('targeted', 0.058), ('composable', 0.058), ('matched', 0.058), ('removes', 0.058), ('regret', 0.057), ('worked', 0.057), ('everyone', 0.055), ('treat', 0.054), ('motivational', 0.054), ('composition', 0.054), ('modifications', 0.054), ('announces', 0.054), ('note', 0.052)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

2 0.27607861 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a

3 0.25203091 400 hunch net-2010-06-13-The Good News on Exploration and Learning

Introduction: Consider the contextual bandit setting where, repeatedly: A context x is observed. An action a is taken given the context x . A reward r is observed, dependent on x and a . Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward. This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions. A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and but

4 0.24947293 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

5 0.21743451 293 hunch net-2008-03-23-Interactive Machine Learning

Introduction: A new direction of research seems to be arising in machine learning: Interactive Machine Learning. This isn’t a familiar term, although it does include some familiar subjects. What is Interactive Machine Learning? The fundamental requirement is (a) learning algorithms which interact with the world and (b) learn. For our purposes, let’s define learning as efficiently competing with a large set of possible predictors. Examples include: Online learning against an adversary ( Avrim’s Notes ). The interaction is almost trivial: the learning algorithm makes a prediction and then receives feedback. The learning is choosing based upon the advice of many experts. Active Learning . In active learning, the interaction is choosing which examples to label, and the learning is choosing from amongst a large set of hypotheses. Contextual Bandits . The interaction is choosing one of several actions and learning only the value of the chosen action (weaker than active learning

6 0.17539757 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

7 0.1605316 309 hunch net-2008-07-10-Interesting papers, ICML 2008

8 0.15269674 220 hunch net-2006-11-27-Continuizing Solutions

9 0.13143538 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

10 0.12741494 237 hunch net-2007-04-02-Contextual Scaling

11 0.12416722 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

12 0.11846599 156 hunch net-2006-02-11-Yahoo’s Learning Problems.

13 0.11565717 423 hunch net-2011-02-02-User preferences for search engines

14 0.11477216 410 hunch net-2010-09-17-New York Area Machine Learning Events

15 0.1129866 392 hunch net-2010-03-26-A Variance only Deviation Bound

16 0.11120576 14 hunch net-2005-02-07-The State of the Reduction

17 0.10437593 183 hunch net-2006-06-14-Explorations of Exploration

18 0.10295661 351 hunch net-2009-05-02-Wielding a New Abstraction

19 0.10061527 120 hunch net-2005-10-10-Predictive Search is Coming

20 0.10028421 360 hunch net-2009-06-15-In Active Learning, the question changes


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.249), (1, 0.104), (2, -0.033), (3, 0.02), (4, 0.078), (5, -0.066), (6, 0.078), (7, 0.002), (8, -0.043), (9, 0.066), (10, 0.148), (11, 0.035), (12, 0.03), (13, 0.16), (14, -0.168), (15, 0.118), (16, 0.032), (17, -0.099), (18, 0.112), (19, 0.092), (20, -0.072), (21, -0.031), (22, 0.016), (23, 0.027), (24, -0.099), (25, 0.183), (26, -0.081), (27, 0.039), (28, -0.058), (29, 0.044), (30, -0.096), (31, -0.007), (32, 0.145), (33, 0.026), (34, 0.005), (35, -0.021), (36, -0.039), (37, -0.007), (38, 0.008), (39, 0.081), (40, -0.03), (41, -0.023), (42, -0.065), (43, 0.018), (44, -0.027), (45, -0.017), (46, 0.063), (47, -0.026), (48, 0.046), (49, 0.008)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96269059 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

2 0.86333996 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a

3 0.8163166 400 hunch net-2010-06-13-The Good News on Exploration and Learning

Introduction: Consider the contextual bandit setting where, repeatedly: A context x is observed. An action a is taken given the context x . A reward r is observed, dependent on x and a . Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward. This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions. A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and but

4 0.79641378 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

5 0.70282567 220 hunch net-2006-11-27-Continuizing Solutions

Introduction: This post is about a general technique for problem solving which I’ve never seen taught (in full generality), but which I’ve found very useful. Many problems in computer science turn out to be discretely difficult. The best known version of such problems are NP-hard problems, but I mean ‘discretely difficult’ in a much more general way, which I only know how to capture by examples. ERM In empirical risk minimization, you choose a minimum error rate classifier from a set of classifiers. This is NP hard for common sets, but it can be much harder, depending on the set. Experts In the online learning with experts setting, you try to predict well so as to compete with a set of (adversarial) experts. Here the alternating quantifiers of you and an adversary playing out a game can yield a dynamic programming problem that grows exponentially. Policy Iteration The problem with policy iteration is that you learn a new policy with respect to an old policy, which implies that sim

6 0.68505013 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

7 0.60267127 293 hunch net-2008-03-23-Interactive Machine Learning

8 0.57084274 392 hunch net-2010-03-26-A Variance only Deviation Bound

9 0.54042345 309 hunch net-2008-07-10-Interesting papers, ICML 2008

10 0.52724737 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

11 0.51858616 169 hunch net-2006-04-05-What is state?

12 0.51741922 58 hunch net-2005-04-21-Dynamic Programming Generalizations and Their Use

13 0.51383495 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction

14 0.49871451 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

15 0.49130186 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

16 0.49086368 183 hunch net-2006-06-14-Explorations of Exploration

17 0.48856714 100 hunch net-2005-08-04-Why Reinforcement Learning is Important

18 0.46915606 351 hunch net-2009-05-02-Wielding a New Abstraction

19 0.46543205 148 hunch net-2006-01-13-Benchmarks for RL

20 0.46426648 276 hunch net-2007-12-10-Learning Track of International Planning Competition


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.014), (3, 0.012), (10, 0.01), (27, 0.249), (38, 0.023), (53, 0.056), (55, 0.064), (64, 0.01), (77, 0.331), (83, 0.014), (94, 0.101), (95, 0.029)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.98661309 436 hunch net-2011-06-22-Ultra LDA

Introduction: Shravan and Alex ‘s LDA code is released . On a single machine, I’m not sure how it currently compares to the online LDA in VW , but the ability to effectively scale across very many machines is surely interesting.

2 0.96475625 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

Introduction: This title is a lie, but it is a special lie which has a bit of truth. If n players each play each other, you have a tournament. How do you order the players from weakest to strongest? The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better). One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith , Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is

3 0.95706385 405 hunch net-2010-08-21-Rob Schapire at NYC ML Meetup

Introduction: I’ve been wanting to attend the NYC ML Meetup for some time and hope to make it next week on the 25th . Rob Schapire is talking about “Playing Repeated Games”, which in my experience is far more relevant to machine learning than the title might indicate.

4 0.92116785 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

same-blog 5 0.90422022 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

6 0.85810572 388 hunch net-2010-01-24-Specializations of the Master Problem

7 0.84349245 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

8 0.81666869 375 hunch net-2009-10-26-NIPS workshops

9 0.73856157 392 hunch net-2010-03-26-A Variance only Deviation Bound

10 0.73764122 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

11 0.71061069 259 hunch net-2007-08-19-Choice of Metrics

12 0.70606649 235 hunch net-2007-03-03-All Models of Learning have Flaws

13 0.69716644 118 hunch net-2005-10-07-On-line learning of regular decision rules

14 0.69520736 237 hunch net-2007-04-02-Contextual Scaling

15 0.69311851 258 hunch net-2007-08-12-Exponentiated Gradient

16 0.69145888 293 hunch net-2008-03-23-Interactive Machine Learning

17 0.69099498 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

18 0.68910897 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

19 0.68808532 220 hunch net-2006-11-27-Continuizing Solutions

20 0.6875487 43 hunch net-2005-03-18-Binomial Weighting