nips nips2010 nips2010-201 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mahdi M. Fard, Joelle Pineau
Abstract: This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. These bounds hold regardless of the correctness of the prior distribution. We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. [sent-5, score-0.291]
2 These bounds hold regardless of the correctness of the prior distribution. [sent-6, score-0.212]
3 We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. [sent-7, score-0.21]
4 Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading. [sent-8, score-0.26]
5 1 Introduction Bayesian methods in machine learning, although elegant and concrete, have often been criticized not only for their computational cost, but also for their strong assumptions on the correctness of the prior distribution. [sent-9, score-0.159]
6 Both PAC and Bayesian methods have been proposed for reinforcement learning (RL) [2, 3, 4, 5, 6, 7, 8], where an agent is learning to interact with an environment to maximize some objective function. [sent-14, score-0.217]
7 Many of these methods aim to solve the so-called exploration–exploitation problem by balancing the amount of time spent on gathering information about the dynamics of the environment and the time spent acting optimally according to the currently built model. [sent-15, score-0.156]
8 We argue here that a more adaptive method can be PAC and at the same time more data efficient if an informative prior is taken into account. [sent-20, score-0.19]
9 This is achieved by removing the assumption of the correctness of the prior and, instead, measuring the consistency of the prior over the training data. [sent-24, score-0.271]
10 We derive two PAC-Bayesian bounds on the approximation error in the value function of stochastic policies for reinforcement learning on observable and discrete state spaces. [sent-28, score-0.364]
11 One is a bound on model-based RL where a prior distribution is given on the space of possible models. [sent-29, score-0.196]
12 In both cases, the bound depends both on an empirical estimate and a measure of distance between the stochastic policy and the one imposed by the prior distribution. [sent-31, score-0.545]
13 We present empirical results where model-selection is performed based on these bounds, and show that PAC-Bayesian bounds follow Bayesian policies when the prior is informative and mimic the PAC policies when the prior is not consistent with the data. [sent-32, score-0.545]
14 A Markov Decision Process (MDP) M = (S, A, T, R) is defined by a set of states S, a set of actions A, a transition function T (s, a, s ) defined as: T (s, a, s ) = p(st+1 = s |st = s, at = a), ∀s, s ∈ S, a ∈ A, (1) and a (possibly stochastic) reward function R(s, a) : S × A → [Rmin , Rmax ]. [sent-35, score-0.232]
15 A reinforcement learning agent chooses an action and receives a reward. [sent-37, score-0.239]
16 The environment will then change to a new state according to the transition probabilities. [sent-38, score-0.246]
17 A policy is a (possibly stochastic) function from states to actions. [sent-39, score-0.186]
18 The value of a state–action pair (s, a) for policy π, denoted by Qπ (s, a), is the expected discounted sum of rewards ( t γ t rt ) if the agent acts according to that policy after taking action a in the first step. [sent-40, score-0.59]
19 (2) s ∈S The optimal policy is the policy that maximizes the value function. [sent-42, score-0.42]
20 The optimal value of a state– action pair, denoted by Q∗ (s, a), satisfies the Bellman optimality equation [14]: Q∗ (s, a) = R(s, a) + γ T (s, a, s ) max Q∗ (s , a ) . [sent-43, score-0.192]
21 a ∈A s ∈S (3) There are many methods developed to find the optimal policy for a given MDP when transition and reward functions are known. [sent-44, score-0.407]
22 3 Model-Based PAC-Bayesian Bound In model-based RL, one aims to estimate the transition and reward functions and then act optimally according to the estimated models. [sent-50, score-0.253]
23 This section provides a bound that suggests an adaptive method to choose a stochastic estimate between these two extremes, which is both data-efficient and has guaranteed performance. [sent-53, score-0.156]
24 2 Assuming that the reward model is known (we make this assumption throughout this section), one can build empirical models of the transition dynamics by gathering sample transitions, denoted by ˆ U , and taking the empirical average. [sent-54, score-0.486]
25 Let this empirical average model be T (s, a, s ) = ns,a,s /ns,a , ˆ where ns,a,s and ns,a are the number of corresponding transitions and samples. [sent-55, score-0.153]
26 ˆ is defined to be the value function on an MDP with The empirical value function, denoted by Q, the empirical transition model. [sent-57, score-0.384]
27 As one observes more and more sample trajectories on the MDP, the empirical model gets increasingly more accurate, and so will the empirical value function. [sent-58, score-0.234]
28 (5) As a consequence of the above lemma, one can act near-optimally in the part of the MDP for which we have gathered enough samples to have a good empirical estimate of the transition model. [sent-63, score-0.271]
29 The downside of these methods is that without further assumptions on the model, it will take a large number of sample transitions to get a good empirical estimate of the transition model. [sent-65, score-0.361]
30 The Bayesian approach to modeling the transition dynamics, on the other hand, starts with a prior distribution over the transition probability and then marginalizes this prior over the data to get a posterior distribution. [sent-66, score-0.595]
31 This is usually done by assuming independent Dirichlet distributions over the transition probabilities, with some initial count vector α, and then adding up the observed counts to this initial vector to get the conjugate posterior [6]. [sent-67, score-0.312]
32 The initial α-vector encodes the prior knowledge on the transition probabilities, and larger initial values further bias the empirical observation towards the initial belief. [sent-68, score-0.396]
33 If a strong prior is close to the true values, the Bayesian posterior will be more accurate than the empirical point estimate. [sent-69, score-0.293]
34 A good posterior distribution might be somewhere between the empirical point estimate and the Bayesian posterior. [sent-72, score-0.204]
35 The following theorem is the first PAC-Bayesian bound on the estimation error in the value function when we build a stochastic policy1 based on some arbitrary posterior distribution Mq . [sent-73, score-0.279]
36 Let πT be the optimal policy with respect to the MDP with transition model T and ∗ ∗ ˆ ∆T = QπT − QπT ∞ . [sent-75, score-0.334]
37 For any prior distribution Mp on the transition model, any posterior Mq , any i. [sent-76, score-0.326]
38 sampling distribution U, with probability no less than 1 − δ over the sampling of U ∼ U: ∀Mq : ET ∼Mq ∆T ≤ D(Mq Mp ) − ln δ + |S| ln 2 + ln |S| + ln nmin , (nmin − 1)k 2 /2 (6) where nmin = mins,a ns,a and D(. [sent-79, score-1.324]
39 The above theorem (proved in the Appendix) provides a lower bound on the expectation of the true value function when the policy is taken to be optimal according to the sampled model from the posterior: EQπ ∗ T ˆ ∗ ˜ ≥ EQπT − O D(Mq Mp )/nmin . [sent-82, score-0.376]
40 (7) This lower bound suggests a stochastic model-selection method in which one searches in the space of posteriors to maximize the bound. [sent-83, score-0.199]
41 One is the PAC part of the bound that suggests the selection of models with high empirical value functions for their optimal policy. [sent-85, score-0.223]
42 1 This is a more general form of stochastic policy than is usually seen in the RL literature. [sent-87, score-0.235]
43 A complete policy is sampled from an imposed distribution, correlating the selection of actions on different states. [sent-88, score-0.247]
44 Generally, this will result in a bound on the value of a stochastic policy. [sent-90, score-0.157]
45 However, if the optimal policy is the same for all of the possible samples from the posterior, then we will get a bound for that particular deterministic policy. [sent-91, score-0.327]
46 We define the support of policy π, denoted by Tπ , to be the set of transition models for which the optimal policy is π. [sent-92, score-0.55]
47 Putting all the posterior probability on Tπ will result in a tighter bound for the value of the policy π. [sent-93, score-0.403]
48 The tightest bound occurs when Mq is a scaled version of Mp summing to 1 over Tπ , that is when we have: Mp (T ) Mp (Tπ ) Mq (T ) = 0 T ∈ Tπ T ∈ Tπ / (8) In that case, the KL divergence is D(Mq Mp ) = − ln Mp (Tπ ), and the bound will be: EQπ ∗ T ˆ ˜ ≥ EQπT − O ∗ − ln Mp (Tπ )/nmin . [sent-94, score-0.334]
49 (9) Intuitively, we will get tighter bounds for policies that have larger empirical values and higher prior probabilities supporting them. [sent-95, score-0.368]
50 Therefore, we define a notion of margin for transition functions and policies and use it to get tractable bounds. [sent-97, score-0.254]
51 The margin of a transition function T , denoted by θT , is the maximum distance we can move away from T such that the optimal policy does not change: T −T 1 ∗ ∗ ≤ θ T ⇒ πT = πT . [sent-98, score-0.42]
52 (10) The margin defines a hypercube around T for which the optimal policy does not change. [sent-99, score-0.275]
53 In cases where the support set of a policy is difficult to find, one can use this hypercube to get a reasonable bound for the true value function of the corresponding policy. [sent-100, score-0.355]
54 In that case, we would define the posterior to be the scaled prior defined only on the margin hypercube. [sent-101, score-0.239]
55 To find the margin of any given T , if we know the value of the second best policy, we can calculate its regret according to T (it will be the smallest regret ηmin ). [sent-104, score-0.303]
56 Using Lemma 1, we can conclude that if T − T 1 ≤ kηmin /2, then the value of the best and second best policies can change by at most ηmin /2, and thus the optimal policy will not change. [sent-105, score-0.294]
57 One can then define the posterior on the transitions inside the margin to get a bound for the value function. [sent-107, score-0.33]
58 4 Model-Free PAC-Bayes Bound In this section we introduce a PAC-Bayesian bound for model-free reinforcement learning on discrete state spaces. [sent-108, score-0.262]
59 This time we assume that we are given a prior distribution on the space of value functions, rather than on transition models. [sent-109, score-0.26]
60 This prior encodes an initial belief about the optimal value function for a given RL domain. [sent-110, score-0.183]
61 This could be useful, for example, in the context of transfer learning, where one has learned a value function in one environment and then uses that as the prior belief on a similar domain. [sent-111, score-0.195]
62 When we only have access to a sample set U ˆ collected on the RL domain, we can define the empirical Bellman optimality operator B to be: 1 ˆ BQ(s, a) = r + γ max Q(s , a ) , (11) a ns,a (s,a,s ,r)∈U ˆ Note that E[BQ] = BQ. [sent-114, score-0.175]
63 Using this assumption, one can use Hoeffding’s inequality to bound the difference between the empirical and true Bellman operators: ˆ Pr{|BQ(s, a) − BQ(s, a)| > } ≤ e−2ns,a 4 2 /c2 . [sent-116, score-0.205]
64 Q-learning [14] and its derivations with function approximation [17], and also batch methods such as LSTD [18], often aim to minimize the empirical (projected) TD error. [sent-118, score-0.156]
65 The following theorem (proved in the Appendix) is the first PAC-Bayesian bound for model-free batch RL on discrete state spaces: ˆ Theorem 3. [sent-121, score-0.221]
66 For all prior distributions Jp and posteriors Jq 1−γ over the space of value functions, with probability no less than 1 − δ over the sampling of U ∼ U: ∀Jq : EQ∼Jq ∆Q ≤ D(Jq Jp ) − ln δ + ln |S| + ln |A| + ln nmin . [sent-123, score-1.03]
67 The PAC side of the bound guides this model-selection method to look for posteriors with smaller empirical TD error. [sent-126, score-0.241]
68 The Bayesian part, on the other hand, penalizes the selection of posteriors that are far from the prior distribution. [sent-127, score-0.206]
69 The last state has a reward of 1 and all other states have reward 0. [sent-134, score-0.209]
70 One is a stochastic “forward” operation which moves us to the next state in the chain with probability 0. [sent-136, score-0.162]
71 The second type is a stochastic “reset” which moves the system to the first state in the chain with probability 0. [sent-138, score-0.162]
72 In this domain, we have at each state two actions that do stochastic reset and one action that is a stochastic forward. [sent-140, score-0.309]
73 When there are only a few number of sample transitions for each state–action pair, there is a high chance that the frequentist estimate confuses a reset action with a forward. [sent-143, score-0.258]
74 We define our good prior to have α-vectors proportional to the true transition probabilities. [sent-147, score-0.236]
75 A misleading prior is one for which the vector is proportional to a transition model when the actions are switched between forward and reset. [sent-148, score-0.319]
76 The empirical method uses the optimal policy with respect to the empirical models. [sent-151, score-0.392]
77 The Bayesian method samples a transition model from the Bayesian Dirichlet posteriors (when the observed counts are added to the prior α-vectors) and then uses the optimal policy with respect to the sampled model. [sent-152, score-0.557]
78 It then samples from that distribution and uses the optimal policy with respect to the sampled model. [sent-156, score-0.236]
79 5 Figure 1 (left) shows the comparison between the maximum regret in these methods for different sample sizes when the prior is informative. [sent-158, score-0.261]
80 This time, the regret rate of the PAC-Bayesian method follows that of the empirical method. [sent-164, score-0.212]
81 Figure 1 (right) shows how the PAC-Bayesian method switches between following the empirical estimate and the Bayesian posterior as the prior gradually changes from being misleading to informative (four sample transitions per state action pair). [sent-165, score-0.674]
82 An agent moves around in a grid world of size 5×9 containing puddles with reward −1, an absorbing goal state with reward +1, and reward 0 for the remaining states. [sent-198, score-0.375]
83 We first learn the true value function of a known prior map of the world (Figure 2, left). [sent-204, score-0.174]
84 We expect the prior to be informative and useful in this case. [sent-207, score-0.19]
85 We sample from this posterior and act according to its greedy policy. [sent-232, score-0.151]
86 05) and compare it with the performance of the empirical policy and a semi-Bayesian policy that acts according to a sampled value from the Bayesian posterior. [sent-236, score-0.533]
87 Again, it can be seen that the PAC-Bayesian method makes use of the prior (with higher values of λ) when the prior is informative, and otherwise follows the empirical estimate (smaller values of λ). [sent-238, score-0.338]
88 6 6 Discussion This paper introduces the first set of PAC-Bayesian bounds for the batch RL problem in finite state spaces. [sent-240, score-0.176]
89 Our empirical results show that PAC-Bayesian model-selection uses prior distributions when they are informative and useful, and ignores them when they are misleading. [sent-242, score-0.26]
90 For the model-based bound, we expect the running time of searching in the space of parametrized posteriors to increase rapidly with the size of the state space. [sent-243, score-0.15]
91 A more scalable version would sample models around the posteriors, solve each model, and then use importance sampling to estimate the value of the bound for each possible posterior. [sent-244, score-0.182]
92 such that Q(n) , P (n) and ∆(n) satisfy the condition of the lemma and EQ ∆ = n→∞ lim n n (n) (n) Qi ∆i , n→∞ i=1 (n) (n) D(Q P) = lim Qi ln i=1 Qi (n) Pi . [sent-262, score-0.163]
93 (16) We will then take the limit of the conclusion of the lemma to get a bound for the continuous case [21]. [sent-263, score-0.197]
94 With probability no less than 1 − δ over the sampling: 2 2 1 2 (nmin −1)k ∆T ]≤ |S|2|S| nmin . [sent-266, score-0.473]
95 To prove Lemma 5, it suffices to prove the following, swap the expectations and apply Markov’s inequality: ET ∼M EU ∼U [e P 2 2 1 2 (nmin −1)k ∆T ] ≤ |S|2|S| nmin . [sent-269, score-0.473]
96 ) ≤ 1 (18) 2 2 1 2 (nmin −1)k ∆T >k } ] follows the (19) s 1 2|S| e− 2 ns,as (k ≤ s 7 )2 1 ≤ |S|2|S| e− 2 nmin (k )2 . [sent-274, score-0.473]
97 2 2 2 1 1 We choose to maximize EU ∼U [e 2 (nmin −1)k ∆T ], subject to Pr{∆T ≥ } ≤ |S|2|S| e− 2 nmin (k ) . [sent-277, score-0.473]
98 for ∆T is: 1 f (∆) = |S|2|S| k 2 nmin ∆e− 2 nmin k 2 ∆2 . [sent-281, score-0.946]
99 (21) f (∆)d∆ (22) We thus get: EU ∼U [e 2 2 1 2 (nmin −1)k ∆T ∞ 1 e 2 (nmin −1)k ] ≤ 2 ∆2 0 ∞ 1 |S|2|S| k 2 nmin ∆e− 2 k = 2 ∆2 d∆ ≤ |S|2|S| nmin . [sent-282, score-0.946]
100 A Bayesian sampling approach to exploration in reinforcement learning. [sent-424, score-0.175]
wordName wordTfidf (topN-words)
[('bq', 0.588), ('nmin', 0.473), ('pac', 0.207), ('policy', 0.186), ('mq', 0.173), ('rl', 0.138), ('bayesian', 0.127), ('transition', 0.124), ('regret', 0.121), ('mp', 0.121), ('reinforcement', 0.115), ('prior', 0.112), ('jq', 0.104), ('bellman', 0.093), ('empirical', 0.091), ('posterior', 0.09), ('bound', 0.084), ('ln', 0.083), ('action', 0.081), ('lemma', 0.08), ('reward', 0.073), ('pr', 0.073), ('eu', 0.072), ('eq', 0.072), ('mdp', 0.068), ('posteriors', 0.066), ('state', 0.063), ('transitions', 0.062), ('policies', 0.06), ('environment', 0.059), ('informative', 0.057), ('qi', 0.055), ('bounds', 0.053), ('eqn', 0.052), ('puddle', 0.052), ('stochastic', 0.049), ('td', 0.048), ('misleading', 0.048), ('jp', 0.048), ('correctness', 0.047), ('agent', 0.043), ('batch', 0.042), ('margin', 0.037), ('exploration', 0.037), ('actions', 0.035), ('mcgill', 0.035), ('cmax', 0.035), ('get', 0.033), ('act', 0.033), ('priors', 0.033), ('optimality', 0.033), ('reset', 0.032), ('theorem', 0.032), ('cmin', 0.032), ('frequentist', 0.032), ('moves', 0.031), ('denoted', 0.03), ('inequality', 0.03), ('mcallester', 0.029), ('exploitation', 0.028), ('hypercube', 0.028), ('gathering', 0.028), ('sample', 0.028), ('penalizes', 0.028), ('forms', 0.027), ('sampled', 0.026), ('hoeffding', 0.026), ('nity', 0.026), ('contraction', 0.024), ('spent', 0.024), ('optimal', 0.024), ('value', 0.024), ('margins', 0.023), ('derivations', 0.023), ('operator', 0.023), ('sampling', 0.023), ('optimistic', 0.023), ('dirichlet', 0.023), ('initial', 0.023), ('estimate', 0.023), ('montreal', 0.022), ('loose', 0.022), ('adaptively', 0.022), ('dynamics', 0.021), ('expect', 0.021), ('pi', 0.021), ('argue', 0.021), ('concludes', 0.021), ('appendix', 0.02), ('domain', 0.02), ('acts', 0.02), ('pair', 0.02), ('counts', 0.019), ('map', 0.019), ('world', 0.019), ('tighter', 0.019), ('move', 0.019), ('gradually', 0.019), ('chain', 0.019), ('introduces', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
Author: Mahdi M. Fard, Joelle Pineau
Abstract: This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. These bounds hold regardless of the correctness of the prior distribution. We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading. 1
2 0.20230143 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
3 0.18984991 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
4 0.18023185 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
5 0.17520756 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
6 0.16087356 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
7 0.15602547 208 nips-2010-Policy gradients in linearly-solvable MDPs
8 0.15108481 134 nips-2010-LSTD with Random Projections
9 0.14910035 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
10 0.14587985 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
11 0.13908429 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
12 0.12939584 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.12411017 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
14 0.11803778 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
15 0.10757613 43 nips-2010-Bootstrapping Apprenticeship Learning
16 0.10722224 4 nips-2010-A Computational Decision Theory for Interactive Assistants
17 0.10593498 229 nips-2010-Reward Design via Online Gradient Ascent
18 0.096661448 192 nips-2010-Online Classification with Specificity Constraints
19 0.094440967 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
20 0.093793571 222 nips-2010-Random Walk Approach to Regret Minimization
topicId topicWeight
[(0, 0.212), (1, -0.298), (2, 0.018), (3, 0.008), (4, -0.024), (5, 0.026), (6, 0.001), (7, 0.003), (8, -0.041), (9, 0.02), (10, -0.005), (11, -0.017), (12, -0.063), (13, -0.013), (14, -0.045), (15, 0.013), (16, 0.009), (17, 0.039), (18, 0.068), (19, 0.04), (20, -0.005), (21, -0.035), (22, -0.037), (23, 0.017), (24, -0.004), (25, 0.068), (26, -0.025), (27, -0.067), (28, 0.027), (29, 0.057), (30, -0.007), (31, -0.069), (32, -0.041), (33, -0.07), (34, 0.06), (35, 0.005), (36, -0.025), (37, 0.029), (38, -0.048), (39, -0.035), (40, 0.001), (41, -0.053), (42, 0.008), (43, -0.002), (44, 0.041), (45, 0.04), (46, -0.01), (47, 0.024), (48, 0.122), (49, -0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.93153399 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
Author: Mahdi M. Fard, Joelle Pineau
Abstract: This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. These bounds hold regardless of the correctness of the prior distribution. We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading. 1
2 0.76829749 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
3 0.74589825 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
4 0.71421289 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
5 0.70558423 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
6 0.69578785 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
7 0.6940459 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
8 0.6823436 183 nips-2010-Non-Stochastic Bandit Slate Problems
9 0.66503781 4 nips-2010-A Computational Decision Theory for Interactive Assistants
10 0.65173608 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
11 0.64915055 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
12 0.63752967 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
13 0.62098396 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
14 0.61130905 64 nips-2010-Distributionally Robust Markov Decision Processes
15 0.59822643 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
16 0.58889812 134 nips-2010-LSTD with Random Projections
17 0.57964551 208 nips-2010-Policy gradients in linearly-solvable MDPs
18 0.57851326 43 nips-2010-Bootstrapping Apprenticeship Learning
19 0.54572773 229 nips-2010-Reward Design via Online Gradient Ascent
20 0.5340873 168 nips-2010-Monte-Carlo Planning in Large POMDPs
topicId topicWeight
[(13, 0.036), (24, 0.23), (27, 0.058), (30, 0.079), (39, 0.023), (45, 0.222), (50, 0.069), (52, 0.029), (60, 0.052), (77, 0.041), (78, 0.016), (90, 0.037)]
simIndex simValue paperId paperTitle
1 0.8981446 183 nips-2010-Non-Stochastic Bandit Slate Problems
Author: Satyen Kale, Lev Reyzin, Robert E. Schapire
Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1
2 0.83994371 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
Author: Katya Scheinberg, Shiqian Ma, Donald Goldfarb
Abstract: Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an ℓ1 -regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an ϵ-optimal solution in O(1/ϵ) iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms. 1
same-paper 3 0.82135153 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
Author: Mahdi M. Fard, Joelle Pineau
Abstract: This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. These bounds hold regardless of the correctness of the prior distribution. We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading. 1
4 0.79704773 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma
Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1
5 0.74593383 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
6 0.74496889 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
7 0.74396676 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
8 0.74293298 243 nips-2010-Smoothness, Low Noise and Fast Rates
9 0.74235266 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
10 0.74170524 155 nips-2010-Learning the context of a category
11 0.74081886 287 nips-2010-Worst-Case Linear Discriminant Analysis
12 0.74041361 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
13 0.74019498 215 nips-2010-Probabilistic Deterministic Infinite Automata
14 0.74012858 148 nips-2010-Learning Networks of Stochastic Differential Equations
15 0.73989016 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
16 0.73936945 280 nips-2010-Unsupervised Kernel Dimension Reduction
17 0.7370255 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
18 0.73693192 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
19 0.73679197 282 nips-2010-Variable margin losses for classifier design
20 0.73543411 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs