hunch_net hunch_net-2010 hunch_net-2010-388 knowledge-graph by maker-knowledge-mining

388 hunch net-2010-01-24-Specializations of the Master Problem


meta infos for this blog

Source: html

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . [sent-2, score-0.557]

2 Consider: How is it even possible to choose a setting where analysis is tractable before you even try to analyze it? [sent-11, score-0.577]

3 Online Learning The online learning setting is perhaps the most satisfying specialization more general than standard batch learning at present, because it turns out to additionally provide tractable algorithms for many batch learning settings. [sent-13, score-0.797]

4 Standard online learning models specialize in two ways: You assume that the choice of action in step 2 does not influence future observations and rewards, and you assume additional information is available in step 3, a retrospectively available reward for each action. [sent-14, score-0.826]

5 The algorithm for an agent in this setting typically has a given name—gradient descent, weighted majority, Winnow, etc… The general algorithm here is a more refined version of follow-the-leader than in batch learning, with online update rules. [sent-15, score-0.851]

6 An awesome discovery about this setting is that it’s possible to compete with a set of predictors even when the world is totally adversarial, substantially strengthening our understanding of what learning is and where it might be useful. [sent-16, score-0.703]

7 The standard form of argument in this setting is a potential argument, where at each step you show that if the learning algorithm performs badly, there is some finite budget from which an adversary deducts it’s ability. [sent-18, score-0.694]

8 The form of the final theorem is that you compete with the accumulated reward of a set any one-step policies h:X – > A , with a dependence log(#policies) or weaker in regret, a measure of failure to compete. [sent-19, score-0.709]

9 Provably computationally tractable special cases all have linear structure, either on rewards or policies . [sent-22, score-0.501]

10 Analysis in this basic setting started very specialized with Gittin’s Indicies and gradually generalized over time to include IID and fully adversarial settings, with EXP3 a canonical algorithm. [sent-26, score-0.566]

11 If there are k strategies available, the standard theorem states that you can compete with the set of all constant strategies up to regret k . [sent-27, score-0.728]

12 The most impressive theoretical discovery in this setting is that the dependence on T , the number of timesteps, is not substantially worse than supervised learning despite the need to explore. [sent-28, score-0.607]

13 However, the setting is flawed, because the set of constant strategies is inevitably too weak in practice—it’s an example of optimal decision making given that you ignore almost all information. [sent-30, score-0.636]

14 Canonical algorithms here are EXP4 (computationally intractable, but information theoretically near-optimal), Epoch-Greedy (computationally tractable given an oracle optimizer), and the Offset Tree providing a reduction to supervised binary classification. [sent-32, score-0.536]

15 MDP analysis A substantial fraction of reinforcement learning has specialized on the Markov Decision Process setting, where the observation x is a state s , which is a sufficient statistic for predicting all future observations. [sent-33, score-0.556]

16 The canonical setting for control theory is with a known MDP having linear transition dynamics. [sent-47, score-0.492]

17 Oracle Advice Shortcuts Techniques here specialize the setting to situations in which some form of oracle advice is available when a policy is being learned. [sent-50, score-0.909]

18 A good example of this is an oracle which provides samples from the distribution of observations visited by a good policy. [sent-51, score-0.493]

19 An alternative form of oracle is provide by access to a good policy at training time. [sent-54, score-0.489]

20 One observation is that the process of predicting adjacent observations well forms states as a byproduct when the observations are sufficiently rich as detailed here . [sent-61, score-0.543]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('setting', 0.317), ('oracle', 0.236), ('dependence', 0.2), ('mdp', 0.174), ('policies', 0.159), ('tractable', 0.155), ('observation', 0.144), ('step', 0.144), ('agent', 0.128), ('observations', 0.127), ('compete', 0.124), ('policy', 0.117), ('strategies', 0.108), ('analysis', 0.105), ('canonical', 0.101), ('world', 0.101), ('rewards', 0.098), ('master', 0.091), ('algorithm', 0.091), ('specialize', 0.09), ('statistic', 0.09), ('discovery', 0.09), ('computationally', 0.089), ('regret', 0.088), ('batch', 0.086), ('reward', 0.084), ('state', 0.084), ('uncontrolled', 0.083), ('announces', 0.083), ('algorithms', 0.082), ('states', 0.081), ('adversarial', 0.081), ('action', 0.081), ('available', 0.078), ('constant', 0.077), ('refined', 0.075), ('delay', 0.075), ('assumed', 0.075), ('known', 0.074), ('set', 0.071), ('standard', 0.071), ('form', 0.071), ('thesis', 0.07), ('filters', 0.07), ('specialized', 0.067), ('sufficient', 0.066), ('good', 0.065), ('dynamic', 0.064), ('detailed', 0.064), ('given', 0.063)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

2 0.32546505 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a

3 0.26508594 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.

4 0.24947293 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

5 0.23045269 183 hunch net-2006-06-14-Explorations of Exploration

Introduction: Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models. Reinforcement Learning (1) . Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a . The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E 3 by Satinder Singh and Michael Kearns . Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E 3 and

6 0.20627446 252 hunch net-2007-07-01-Watchword: Online Learning

7 0.20302184 220 hunch net-2006-11-27-Continuizing Solutions

8 0.20001486 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

9 0.19090711 235 hunch net-2007-03-03-All Models of Learning have Flaws

10 0.18716422 400 hunch net-2010-06-13-The Good News on Exploration and Learning

11 0.18504468 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

12 0.18476298 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

13 0.17329448 169 hunch net-2006-04-05-What is state?

14 0.1640573 127 hunch net-2005-11-02-Progress in Active Learning

15 0.16220936 343 hunch net-2009-02-18-Decision by Vetocracy

16 0.16155382 8 hunch net-2005-02-01-NIPS: Online Bayes

17 0.15829334 100 hunch net-2005-08-04-Why Reinforcement Learning is Important

18 0.15633464 309 hunch net-2008-07-10-Interesting papers, ICML 2008

19 0.155986 351 hunch net-2009-05-02-Wielding a New Abstraction

20 0.15153983 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.339), (1, 0.195), (2, 0.027), (3, -0.011), (4, 0.146), (5, -0.102), (6, 0.075), (7, -0.024), (8, -0.034), (9, 0.141), (10, 0.189), (11, 0.0), (12, 0.053), (13, 0.109), (14, -0.073), (15, 0.109), (16, 0.05), (17, -0.13), (18, 0.071), (19, 0.113), (20, -0.14), (21, -0.11), (22, 0.028), (23, 0.001), (24, -0.125), (25, 0.114), (26, -0.163), (27, -0.055), (28, 0.027), (29, 0.015), (30, -0.077), (31, -0.029), (32, 0.096), (33, 0.059), (34, 0.046), (35, 0.005), (36, 0.001), (37, 0.043), (38, -0.002), (39, -0.069), (40, -0.03), (41, 0.008), (42, 0.006), (43, 0.024), (44, -0.019), (45, -0.017), (46, 0.079), (47, -0.01), (48, 0.004), (49, -0.027)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97242224 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

2 0.90617317 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a

3 0.85148084 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

4 0.78330117 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

Introduction: There were two papers at ICML presenting learning algorithms for a contextual bandit -style setting, where the loss for all labels is not known, but the loss for one label is known. (The first might require a exploration scavenging viewpoint to understand if the experimental assignment was nonrandom.) I strongly approve of these papers and further work in this setting and its variants, because I expect it to become more important than supervised learning. As a quick review, we are thinking about situations where repeatedly: The world reveals feature values (aka context information). A policy chooses an action. The world provides a reward. Sometimes this is done in an online fashion where the policy can change based on immediate feedback and sometimes it’s done in a batch setting where many samples are collected before the policy can change. If you haven’t spent time thinking about the setting, you might want to because there are many natural applications. I’m g

5 0.7721402 220 hunch net-2006-11-27-Continuizing Solutions

Introduction: This post is about a general technique for problem solving which I’ve never seen taught (in full generality), but which I’ve found very useful. Many problems in computer science turn out to be discretely difficult. The best known version of such problems are NP-hard problems, but I mean ‘discretely difficult’ in a much more general way, which I only know how to capture by examples. ERM In empirical risk minimization, you choose a minimum error rate classifier from a set of classifiers. This is NP hard for common sets, but it can be much harder, depending on the set. Experts In the online learning with experts setting, you try to predict well so as to compete with a set of (adversarial) experts. Here the alternating quantifiers of you and an adversary playing out a game can yield a dynamic programming problem that grows exponentially. Policy Iteration The problem with policy iteration is that you learn a new policy with respect to an old policy, which implies that sim

6 0.75631636 400 hunch net-2010-06-13-The Good News on Exploration and Learning

7 0.67380327 183 hunch net-2006-06-14-Explorations of Exploration

8 0.65041167 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

9 0.63770175 309 hunch net-2008-07-10-Interesting papers, ICML 2008

10 0.6067487 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction

11 0.60047644 169 hunch net-2006-04-05-What is state?

12 0.59768391 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

13 0.56485981 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

14 0.56088817 148 hunch net-2006-01-13-Benchmarks for RL

15 0.56019783 28 hunch net-2005-02-25-Problem: Online Learning

16 0.5543983 100 hunch net-2005-08-04-Why Reinforcement Learning is Important

17 0.55213696 293 hunch net-2008-03-23-Interactive Machine Learning

18 0.55002713 101 hunch net-2005-08-08-Apprenticeship Reinforcement Learning for Control

19 0.54828745 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

20 0.54724652 392 hunch net-2010-03-26-A Variance only Deviation Bound


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.038), (9, 0.022), (27, 0.242), (38, 0.039), (49, 0.021), (53, 0.06), (55, 0.053), (56, 0.013), (63, 0.01), (64, 0.02), (77, 0.234), (84, 0.03), (94, 0.083), (95, 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.98155677 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

Introduction: This title is a lie, but it is a special lie which has a bit of truth. If n players each play each other, you have a tournament. How do you order the players from weakest to strongest? The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better). One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith , Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is

2 0.95014775 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

3 0.94401425 436 hunch net-2011-06-22-Ultra LDA

Introduction: Shravan and Alex ‘s LDA code is released . On a single machine, I’m not sure how it currently compares to the online LDA in VW , but the ability to effectively scale across very many machines is surely interesting.

4 0.93773144 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

same-blog 5 0.92827213 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

6 0.90124583 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

7 0.89883465 405 hunch net-2010-08-21-Rob Schapire at NYC ML Meetup

8 0.85046804 375 hunch net-2009-10-26-NIPS workshops

9 0.81112856 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

10 0.8078354 259 hunch net-2007-08-19-Choice of Metrics

11 0.80040449 235 hunch net-2007-03-03-All Models of Learning have Flaws

12 0.7924692 392 hunch net-2010-03-26-A Variance only Deviation Bound

13 0.79216623 183 hunch net-2006-06-14-Explorations of Exploration

14 0.78431195 258 hunch net-2007-08-12-Exponentiated Gradient

15 0.78369051 220 hunch net-2006-11-27-Continuizing Solutions

16 0.78344136 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

17 0.78312498 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

18 0.78221607 237 hunch net-2007-04-02-Contextual Scaling

19 0.77609891 351 hunch net-2009-05-02-Wielding a New Abstraction

20 0.77406412 12 hunch net-2005-02-03-Learning Theory, by assumption