nips nips2010 nips2010-152 knowledge-graph by maker-knowledge-mining

152 nips-2010-Learning from Logged Implicit Exploration Data

Source: pdf

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “ofﬂine” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. [sent-11, score-0.332]

2 Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. [sent-13, score-0.323]

3 The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. [sent-14, score-0.784]

4 1 Introduction Consider the advertisement display problem, where a search engine company chooses an ad to display which is intended to interest the user. [sent-17, score-0.378]

5 An instance of the contextual bandit problem is speciﬁed by a distribution D over tuples (x, r) where x ∈ X is an input and r ∈ [0, 1]k is a vector of rewards [6]. [sent-26, score-0.355]

6 The algorithm chooses an action a ∈ A, possibly as a function of x and historical information. [sent-30, score-0.322]

7 The world announces the reward ra of action a, but not ra′ for a′ = a. [sent-32, score-0.442]

8 1 It is critical to understand that this is not a standard supervised-learning problem, because the reward of other actions a′ = a is not revealed. [sent-36, score-0.223]

9 The standard goal in this setting is to maximize the sum of rewards ra over the rounds of interaction. [sent-37, score-0.198]

10 In order to do this well, it is essential to use previously recorded events to form a good policy on the ﬁrst round of interaction. [sent-38, score-0.757]

11 Formally, given a dataset of the form S = (x, a, ra )∗ generated by the interaction of an uncontrolled logging policy, we want to construct a policy h maximizing (either exactly or approximately) V h := E(x,r)∼D [rh(x) ]. [sent-40, score-1.07]

12 We could learn a regressor s : X × A → [0, 1] which is trained to predict the reward, on observed events conditioned on the action a and other information x. [sent-43, score-0.393]

13 From this regressor, a policy is derived according to h(x) = argmaxa∈A s(x, a). [sent-44, score-0.571]

14 Suppose that there are two actions a and b with action a occurring 106 times and action b occuring 102 times. [sent-47, score-0.411]

15 Since action b occurs only a 10−4 fraction of the time, a learning algorithm forced to trade off between predicting the expected value of ra and rb overwhelmingly prefers to estimate ra well at the expense of accurate estimation for rb . [sent-48, score-0.615]

16 This problem is only worse when action b occurs zero times, as might commonly occur in exploration situations. [sent-50, score-0.266]

17 Existing approaches to contextual bandits such as EXP4 [1] or Epoch Greedy [6], require either interaction to gather data or require knowledge of the probability the logging policy chose the action a. [sent-58, score-1.179]

18 It is possible to recover exploration information from action visitation frequency when a logging policy chooses actions independent of the input x (but possibly dependent on history) [5]. [sent-62, score-1.401]

19 This doesn’t ﬁt our setting, where the logging policy is surely dependent on the input. [sent-63, score-0.863]

20 When conducting a survey, a question about income might be included, and then the proportion of responders at various income levels can be compared to census data to estimate a probability conditioned on income that someone chooses to partake in the survey. [sent-66, score-0.383]

21 This approach is problematic here, because the policy making decisions when logging the data may be deterministic rather than probabilistic. [sent-68, score-0.903]

22 In other words, accurately predicting the probability of the logging policy choosing an ad implies always predicting 0 or 1 which is not useful for our purposes. [sent-69, score-1.009]

23 Although the straightforward use of propensity scores does not work, the approach we take can be thought of as as a more clever use of a propensity score, as discussed below. [sent-70, score-0.222]

24 For each event (x, a, ra ), estimate the probability π (a|x) that the logging policy chooses ˆ action a using regression. [sent-74, score-1.388]

25 For each event (x, a, ra ), create a synthetic controlled contextual bandit event according to (x, a, ra , 1/ max{ˆ (a|x), τ }) where τ > 0 is some parameter. [sent-77, score-0.752]

26 Apply an ofﬂine contextual bandit algorithm to the set of synthetic contextual bandit events. [sent-81, score-0.6]

27 2) a variant of the argmax regressor is used with two critical modiﬁcations: (a) We limit the scope of the argmax to those actions with positive probability; (b) We importance weight events so that the training process emphasizes good estimation for each action equally. [sent-83, score-0.685]

28 It should be emphasized that the theoretical analysis in this paper applies to any algorithm for learning on contextual bandit events—we chose this one because it is a simple modiﬁcation on existing (but fundamentally broken) approaches. [sent-84, score-0.3]

29 Relative to it, we use a different deﬁnition of probability which is not necessarily 0 or 1 when the logging policy is completely deterministic. [sent-86, score-0.863]

30 What does π (a|x) mean, given that the logging policy may be deterministically choosing ˆ an action (ad) a given features x? [sent-89, score-1.042]

31 The essential observation is that a policy which deterministically chooses action a on day 1 and then deterministically chooses action b on day 2 can be treated as randomizing between actions a and b with probability 0. [sent-90, score-1.346]

32 5 when the number of events is the same each day, and the events are IID. [sent-91, score-0.28]

33 Thus π (a|x) is an estimate ˆ of the expected frequency with which action a would be displayed given features x over the timespan of the logged events. [sent-92, score-0.377]

34 While creating a bias in the estimation process, it turns out that the form of this bias is mild and relatively reasonable— actions which are displayed with low frequency conditioned on x effectively have an underestimated value. [sent-102, score-0.284]

35 We close with a generalization from policy evaluation to policy selection with a sample complexity bound in section 3. [sent-106, score-1.209]

36 The learning algorithm is given a dataset of T samples, each of the form (x, a, ra ) ∈ X × A× [0, 1], where (x, r) is drawn from D as described in Section 1, and the action a ∼ πt (x) is chosen according to the tth policy. [sent-112, score-0.377]

37 We denote this random process by (x, a, ra ) ∼ (D, πt (·|x)). [sent-113, score-0.171]

38 i=1 Ofﬂine policy estimator: Given a dataset of the form (1) S = {(xt , at , rt,at )}T , t=1 ˆ where ∀t, xt ∈ X, at ∈ A, rt,at ∈ [0, 1], we form a predictor π : X × A → [0, 1] and then use it with a threshold τ ∈ [0, 1] to form an ofﬂine estimator for the value of a policy h. [sent-116, score-1.367]

39 Formally, given a new policy h : X → A and a dataset S, deﬁne the estimator: 1 ra I(h(x) = a) h ˆˆ Vπ (S) = , |S| max{ˆ (a|x), τ } π (2) (x,a,r)∈S ˆh where I(·) denotes the indicator function. [sent-117, score-0.778]

40 The main idea is twofold: ﬁrst, we have a policy estimation step, where we estimate the (unknown) logging policy (Subsection 3. [sent-122, score-1.48]

41 1); second, we have a policy optimization step, where we utilize our estimated logging policy (Subsection 3. [sent-123, score-1.434]

42 The logging policy πt may be deterministic, implying that conventional approaches relying on randomization in the logging policy are not applicable. [sent-127, score-1.782]

43 We show next that this is ok when the world is IID and the policy varies over its actions. [sent-128, score-0.598]

44 A basic claim is that the estimator is equivalent, in expectation, to a stochastic policy deﬁned by: π(a|x) = Et∼UNIF(1,. [sent-130, score-0.731]

45 The stochastic policy π chooses an action uniformly at random over the T policies πt . [sent-134, score-1.016]

46 Our ﬁrst result is that the expected value of our estimator is the same when the world chooses actions according to either π or to the sequence of policies πt . [sent-135, score-0.564]

47 1 Policy Estimation In this section we show that for a suitable choice of τ and π our estimator is sufﬁciently accurate ˆ for evaluating new policies h. [sent-144, score-0.274]

48 We aggressively use the simpliﬁcation of the previous section, which shows that we can think of the data as generated by a ﬁxed stochastic policy π, i. [sent-145, score-0.61]

49 In the following theorem statement, I(·) denotes the indicator ˆh function, π(a|x) the probability that the logging policy chooses action a on input x, and Vπ our ˆ estimator as deﬁned by Equation 2 based on parameter τ . [sent-151, score-1.237]

50 Let V h (x) = Er∼D(·|x) [rh(x) ] denote the expected value of executing policy h on input x. [sent-156, score-0.599]

51 1 shows that the expected value of our estimate Vπ of a policy h is an approximation to a lower bound of the true value of the policy h where the approximation is due to errors in the estimate π and the lower bound is due to the threshold τ . [sent-165, score-1.312]

52 ˆ ˆh Thus, with a perfect predictor of π, the expected value of the estimator Vπ is a guaranteed lower ˆ bound on the true value of policy h. [sent-168, score-0.774]

53 However, as the left-hand-side of this statement suggests, it may be a very loose bound, especially if the action chosen by h often has a small probability of being chosen by π. [sent-169, score-0.222]

54 Consider an instance of the bandit problem with a single input x and two actions a1 , a2 . [sent-172, score-0.288]

55 Suppose that π(a1 |x) = τ + ǫ for some positive ǫ and h(x) = a1 is the policy we are evaluating. [sent-173, score-0.571]

56 2 Policy Optimization The previous section proves that we can effectively evaluate a policy h by observing a stochastic policy π, as long as the actions chosen by h have adequate support under π, speciﬁcally π(h(x)|x) ≥ τ for all inputs x. [sent-178, score-1.36]

57 However, we are often interested in choosing the best policy h from a set of policies H after observing logged data. [sent-179, score-0.841]

58 Furthermore, as described in Section 2, the logged data are generated from T ﬁxed, possibly deterministic, policies π1 , . [sent-180, score-0.307]

59 As in Section 3 we deﬁne the stochastic policy π by Equation 3, π(a|x) = Et∼UNIF(1,. [sent-184, score-0.61]

60 ˜ ˆ = argmax ˆ h } be the hypothesis that maximizes the empirical value estimator deﬁned Let h {Vπ h∈H ˆ in Equation 2. [sent-199, score-0.178]

61 In other words, if H contains a very good policy that has little support under π, we will not be able to detect that by our estimator. [sent-204, score-0.571]

62 On the other hand, our estimation is safe in the sense that we will never drastically overestimate the value of any policy in H. [sent-205, score-0.639]

63 The ﬁrst dataset consists of uniformly random exploration data, from which an unbiased estimate of any policy can be obtained. [sent-209, score-0.808]

64 The second dataset then demonstrates how policy optimization can be done from nonrandom ofﬂine data. [sent-211, score-0.646]

65 1 Experiment I The ﬁrst experiment involves news article recommendation in the “Today Module”, on the Yahoo! [sent-213, score-0.188]

66 For every user visit, this module displays a high-quality news article out of a small candidate pool, which is hand-picked by human editors. [sent-215, score-0.25]

67 This problem is modeled as a contextual bandit problem, where the context consists of both user and article information, the arms correspond to articles, and the reward of a displayed article is 1 if there is a click and 0 otherwise. [sent-218, score-0.739]

68 Therefore, the value of a policy is exactly its overall CTR. [sent-219, score-0.571]

69 7M events in the form of triples (x, a, r), where the context x contains user/article features, arm a was chosen uniformly at random from a dynamic candidate pool A, and r is a binary reward signal indicating whether the user clicked on a. [sent-224, score-0.376]

70 Furthermore, a straightforward application ˆ √ ˆh of Hoeffding’s inequality guarantees that Vπ concentrates to V h at the rate of O(1/ T ) for any ˆ policy h, which is also veriﬁed empirically [9]. [sent-228, score-0.571]

71 To obtain non-random log data, we ran the LinUCB algorithm using the ofﬂine bandit simulation procedure, both from [8], on our random log data D0 and recorded events (x, a, r) for which LinUCB chose arm a for context x. [sent-231, score-0.349]

72 It is known that the set of recorded events has the same distribution as if we ran LinUCB on real user visits to Yahoo! [sent-234, score-0.247]

73 To deﬁne the policy h for evaluation, we used D0 to estimate each article’s overall CTR across all users, and then h was deﬁned as selecting the article with highest estimated CTR. [sent-237, score-0.71]

74 Since the set A of articles changes over time (with news articles being added and old articles retiring), π(a|x) is very small due to the large number of articles over the two-week period, resulting in large variance. [sent-239, score-0.369]

75 For extremely small values of τ , however, there appears to be a consistent trend of over-estimating the policy value. [sent-247, score-0.571]

76 Note that the logging policy we used, π, violates one of the assumptions used to prove Lemma 3. [sent-249, score-0.863]

77 1, namely that the exploration policy at timestep t not be dependent on an earlier event. [sent-250, score-0.694]

78 The data are comprised of logs of events (x, a, y), where each event represents a visit by a user to a particular web page x, from a set of web pages X. [sent-256, score-0.45]

79 From a large set of advertisements A, the commercial system chooses a single ad 3 We could do so because we know A for every event in D0 . [sent-257, score-0.311]

80 0071] Figure 2: Results of various algorithms on the ad display dataset. [sent-277, score-0.192]

81 It also chooses additional ads to display, but these were ignored in our test. [sent-286, score-0.227]

82 The output y is an indicator of whether the user clicked on the ad or not. [sent-287, score-0.255]

83 The test data contain 19 million events occurring after the events in the training data. [sent-290, score-0.28]

84 We trained a policy h to choose an ad, based on the current page, to maximize the probability of click. [sent-293, score-0.571]

85 For the purposes of learning, each ad and page was represented internally as a sparse highdimensional feature vector. [sent-294, score-0.212]

86 Each ad contains, on average, 30 ad features and each page, approximately 50 page features. [sent-296, score-0.358]

87 The particular form of f was linear over features of its input (x, a)4 The particular policy that was optimized, had an argmax form: h(x) = argmaxa∈C(X) {f (x, a)}, with a crucial distinction from previous approaches in how f (x, a) was trained. [sent-297, score-0.628]

88 ˆ The training samples were of the form (x, a, y), where y = 1 if the ad a was clicked after being shown on page x or y = 0 otherwise. [sent-299, score-0.26]

89 As shown in the analysis of Section 3, this bias typically underestimates the true value of the policy h. [sent-304, score-0.616]

90 4 Technically the feature vector that the regressor uses is the Cartesian product of the page and ad vectors. [sent-311, score-0.296]

91 5 7 The “Random” policy is the policy that chooses randomly from the set of feasible ads: Random(x) = a ∼ UNIF(C(X)), where UNIF(·) denotes the uniform distribution. [sent-319, score-1.252]

92 The “Naive” policy corresponds to the theoretically ﬂawed supervised learning approach detailed in the introduction. [sent-320, score-0.571]

93 The evaluation of this policy is quite expensive, requiring one evaluation per ad per example, so the size of the test set is reduced to 8373 examples with a click, which reduces the signiﬁcance of the results. [sent-321, score-0.801]

94 We bias the results towards the naive policy by choosing the chronologically ﬁrst events in the test set (i. [sent-322, score-0.823]

95 Nevertheless, the naive policy receives 0 reward, which is signiﬁcantly less than all other approaches. [sent-325, score-0.638]

96 A possible fear with the evaluation here is that the naive policy is always ﬁnding good ads that simply weren’t explored. [sent-326, score-0.797]

97 Indeed, the estimates for both the learned policy and the random policy improve when we decrease τ from 0. [sent-334, score-1.168]

98 5 Conclusion We stated, justiﬁed, and evaluated theoretically and empirically the ﬁrst method for solving the warm start problem for exploration from logged data with controlled bias and estimation. [sent-342, score-0.333]

99 For example, in reinforcement learning, the standard approach to ofﬂine policy evaluation is based on importance weighted samples [3, 11]. [sent-348, score-0.651]

100 Unbiased ofﬂine evaluation of contextualbandit-based news article recommendation algorithms. [sent-383, score-0.23]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.571), ('logging', 0.292), ('ine', 0.197), ('ra', 0.171), ('bandit', 0.163), ('policies', 0.153), ('ad', 0.146), ('action', 0.143), ('events', 0.14), ('contextual', 0.137), ('yahoo', 0.132), ('actions', 0.125), ('exploration', 0.123), ('estimator', 0.121), ('ads', 0.117), ('logged', 0.117), ('reg', 0.111), ('ctr', 0.111), ('evaluator', 0.111), ('propensity', 0.111), ('chooses', 0.11), ('article', 0.093), ('regressor', 0.084), ('unif', 0.079), ('articles', 0.075), ('news', 0.069), ('naive', 0.067), ('income', 0.067), ('lihong', 0.067), ('linucb', 0.066), ('page', 0.066), ('click', 0.063), ('user', 0.061), ('langford', 0.059), ('argmax', 0.057), ('reward', 0.057), ('randomization', 0.056), ('event', 0.055), ('web', 0.05), ('warm', 0.048), ('clicked', 0.048), ('estimate', 0.046), ('recorded', 0.046), ('display', 0.046), ('bias', 0.045), ('announces', 0.044), ('nctr', 0.044), ('strehl', 0.044), ('pool', 0.043), ('displayed', 0.043), ('evaluation', 0.042), ('critical', 0.041), ('deterministic', 0.04), ('stochastic', 0.039), ('ex', 0.039), ('overestimate', 0.039), ('nonrandom', 0.039), ('xt', 0.039), ('importance', 0.038), ('front', 0.037), ('possibly', 0.037), ('day', 0.036), ('lemma', 0.036), ('dataset', 0.036), ('bandits', 0.036), ('lambert', 0.036), ('rh', 0.036), ('deterministically', 0.036), ('regret', 0.034), ('vh', 0.033), ('companies', 0.033), ('advertising', 0.033), ('unbiased', 0.032), ('historical', 0.032), ('argmaxa', 0.032), ('chu', 0.03), ('engine', 0.03), ('safe', 0.029), ('arms', 0.029), ('john', 0.029), ('predictor', 0.029), ('visit', 0.028), ('compete', 0.028), ('rb', 0.028), ('tuples', 0.028), ('robert', 0.028), ('expected', 0.028), ('chosen', 0.027), ('rewards', 0.027), ('world', 0.027), ('adequate', 0.027), ('module', 0.027), ('draws', 0.027), ('recommendation', 0.026), ('learned', 0.026), ('conditioned', 0.026), ('statement', 0.025), ('bound', 0.025), ('wei', 0.025), ('period', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 152 nips-2010-Learning from Logged Implicit Exploration Data

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

2 0.42001817 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Author: Tang Jie, Pieter Abbeel

Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1

3 0.35453713 208 nips-2010-Policy gradients in linearly-solvable MDPs

Author: Emanuel Todorov

Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the ﬁrst time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the ﬁrst compatible function approximators and natural policy gradients for continuous-time stochastic systems.

4 0.35265294 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum

Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1

5 0.35165253 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

Author: Atsushi Miyamae, Yuichi Nagata, Isao Ono, Shigenobu Kobayashi

Abstract: In this paper, we propose an efﬁcient algorithm for estimating the natural policy gradient using parameter-based exploration; this algorithm samples directly in the parameter space. Unlike previous methods based on natural gradients, our algorithm calculates the natural policy gradient using the inverse of the exact Fisher information matrix. The computational cost of this algorithm is equal to that of conventional policy gradients whereas previous natural policy gradient methods have a prohibitive computational cost. Experimental results show that the proposed method outperforms several policy gradient methods. 1

6 0.35109085 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

7 0.31986234 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

8 0.26481473 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

9 0.26205698 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration

10 0.21888399 43 nips-2010-Bootstrapping Apprenticeship Learning

11 0.21852535 134 nips-2010-LSTD with Random Projections

12 0.17520756 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

13 0.17239675 38 nips-2010-Batch Bayesian Optimization via Simulation Matching

14 0.17231023 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

15 0.16755262 183 nips-2010-Non-Stochastic Bandit Slate Problems

16 0.16603726 203 nips-2010-Parametric Bandits: The Generalized Linear Case

17 0.16299625 269 nips-2010-Throttling Poisson Processes

18 0.15927033 4 nips-2010-A Computational Decision Theory for Interactive Assistants

19 0.15152903 93 nips-2010-Feature Construction for Inverse Reinforcement Learning

20 0.13965389 168 nips-2010-Monte-Carlo Planning in Large POMDPs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.268), (1, -0.556), (2, -0.069), (3, -0.074), (4, 0.006), (5, -0.029), (6, 0.026), (7, -0.039), (8, 0.004), (9, -0.114), (10, -0.001), (11, 0.012), (12, -0.025), (13, 0.021), (14, 0.0), (15, 0.009), (16, 0.067), (17, -0.037), (18, -0.026), (19, -0.022), (20, -0.002), (21, -0.027), (22, 0.021), (23, 0.051), (24, 0.001), (25, -0.03), (26, 0.065), (27, -0.017), (28, -0.019), (29, 0.087), (30, 0.059), (31, -0.031), (32, 0.019), (33, -0.039), (34, 0.041), (35, 0.002), (36, -0.023), (37, -0.063), (38, -0.048), (39, 0.06), (40, -0.009), (41, -0.032), (42, -0.077), (43, 0.086), (44, 0.031), (45, 0.028), (46, 0.033), (47, 0.054), (48, 0.042), (49, -0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97361892 152 nips-2010-Learning from Logged Implicit Exploration Data

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

2 0.91169685 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Author: Tang Jie, Pieter Abbeel

3 0.90319216 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

Author: Umar Syed, Robert E. Schapire

Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classiﬁcation algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classiﬁer has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difﬁcult to obtain. 1

4 0.87794173 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

Author: Atsushi Miyamae, Yuichi Nagata, Isao Ono, Shigenobu Kobayashi

5 0.8437233 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr

Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overﬁtting. We propose formulating the L1 regularized linear ﬁxed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efﬁcient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efﬁciency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modiﬁed policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1

6 0.83528167 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

7 0.81748331 208 nips-2010-Policy gradients in linearly-solvable MDPs

8 0.7849077 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration

9 0.69881409 43 nips-2010-Bootstrapping Apprenticeship Learning

10 0.6936413 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

11 0.67649353 38 nips-2010-Batch Bayesian Optimization via Simulation Matching

12 0.64722121 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories

13 0.60696971 4 nips-2010-A Computational Decision Theory for Interactive Assistants

14 0.6027559 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

15 0.57603705 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

16 0.56541222 134 nips-2010-LSTD with Random Projections

17 0.56313044 64 nips-2010-Distributionally Robust Markov Decision Processes

18 0.53301257 269 nips-2010-Throttling Poisson Processes

19 0.52267635 212 nips-2010-Predictive State Temporal Difference Learning

20 0.51527321 183 nips-2010-Non-Stochastic Bandit Slate Problems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.042), (17, 0.012), (27, 0.071), (30, 0.067), (39, 0.017), (45, 0.217), (50, 0.036), (52, 0.04), (60, 0.033), (67, 0.249), (77, 0.057), (78, 0.03), (90, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87787127 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions

Author: Sridhar Mahadevan, Bo Liu

Abstract: This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov and Bellman error bases are based on the Neumann series expansion. These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. The Laurent series expansion, which relates discounted and average-reward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efﬁcient basis representations. The ﬁrst two terms in the Laurent series represent the scaled average-reward and the average-adjusted sum of rewards, and subsequent terms expand the discounted value function using powers of a generalized inverse called the Drazin (or group inverse) of a singular matrix derived from the transition matrix. Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. An incremental variant of Drazin bases called Bellman average-reward bases (BARBs) is described, which provides some of the same beneﬁts at lower computational cost. 1

same-paper 2 0.81537926 152 nips-2010-Learning from Logged Implicit Exploration Data

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

3 0.70741701 63 nips-2010-Distributed Dual Averaging In Networks

Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi

Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is conﬁrmed both by theoretical lower bounds and simulations for various networks. 1

4 0.70734596 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma

Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1

5 0.70587814 282 nips-2010-Variable margin losses for classifier design

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The problem of controlling the margin of a classiﬁer is studied. A detailed analytical study is presented on how properties of the classiﬁcation risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classiﬁcation margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the ﬁxed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter. 1

6 0.70558351 155 nips-2010-Learning the context of a category

7 0.70514309 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

8 0.70305985 182 nips-2010-New Adaptive Algorithms for Online Classification

9 0.70233661 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

10 0.70203704 1 nips-2010-(RF)^2 -- Random Forest Random Field

11 0.70179588 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

12 0.70170361 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

13 0.70162559 243 nips-2010-Smoothness, Low Noise and Fast Rates

14 0.7015757 290 nips-2010-t-logistic regression

15 0.70085883 203 nips-2010-Parametric Bandits: The Generalized Linear Case

16 0.70079029 158 nips-2010-Learning via Gaussian Herding

17 0.70064652 280 nips-2010-Unsupervised Kernel Dimension Reduction

18 0.70047331 148 nips-2010-Learning Networks of Stochastic Differential Equations

19 0.7002179 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

20 0.69913214 172 nips-2010-Multi-Stage Dantzig Selector