nips nips2013 nips2013-1 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ian Osband, Dan Russo, Benjamin Van Roy
Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. [sent-4, score-0.404]
2 We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). [sent-5, score-0.289]
3 PSRL then follows the policy that is optimal for this sample during the episode. [sent-8, score-0.169]
4 The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. [sent-9, score-0.239]
5 We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. [sent-10, score-0.232]
6 This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. [sent-11, score-0.221]
7 We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. [sent-12, score-0.246]
8 1 Introduction We consider the classical reinforcement learning problem of an agent interacting with its environment while trying to maximize total reward accumulated over time [1, 2]. [sent-13, score-0.455]
9 The agent’s environment is modeled as a Markov decision process (MDP), but the agent is uncertain about the true dynamics of the MDP. [sent-14, score-0.25]
10 As the agent interacts with its environment, it observes the outcomes that result from previous states and actions, and learns about the system dynamics. [sent-15, score-0.214]
11 This leads to a fundamental tradeoff: by exploring poorly-understood states and actions the agent can learn to improve future performance, but it may attain better short-run performance by exploiting its existing knowledge. [sent-16, score-0.25]
12 To offset this, the majority of provably efficient learning algorithms use a principle known as optimism in the face of uncertainty [3] to encourage exploration. [sent-18, score-0.175]
13 In such an algorithm, each state and action is afforded some optimism bonus such that their value to the agent is modeled to be as high as is statistically plausible. [sent-19, score-0.454]
14 The agent will then choose a policy that is optimal under this “optimistic” model of the environment. [sent-20, score-0.339]
15 This incentivizes exploration since poorly-understood states and actions will receive a higher optimism bonus. [sent-21, score-0.263]
16 As the agent resolves its uncertainty, the effect of optimism is reduced and the agent’s behavior approaches optimality. [sent-22, score-0.31]
17 In fact, almost all reinforcement learning algorithms with polynomial bounds on sample complexity employ optimism to guide exploration. [sent-24, score-0.381]
18 1 At the start of each episode, the agent chooses a new policy, which it follows for the duration of the episode. [sent-27, score-0.192]
19 Posterior sampling for reinforcement learning (PSRL) selects this policy through two simple steps. [sent-28, score-0.393]
20 First, a single instance of the environment is sampled from the posterior distribution at the start of an episode. [sent-29, score-0.198]
21 Then, PSRL solves for and executes the policy that is optimal under the sampled environment over the episode. [sent-30, score-0.266]
22 PSRL randomly selects policies according to the probability they are optimal; exploration is guided by the variance of sampled policies as opposed to optimism. [sent-31, score-0.185]
23 Despite its long history, posterior sampling was largely neglected by the multi-armed bandit literature until empirical studies [10, 11] demonstrated that the algorithm could produce state of the art performance. [sent-34, score-0.178]
24 Our results suggest this method has great potential in reinforcement learning as well. [sent-36, score-0.166]
25 PSRL was originally introduced in the context of reinforcement learning by Strens [16] under the name “Bayesian Dynamic Programming”,2 where it appeared primarily as a heuristic method. [sent-37, score-0.166]
26 BOSS [18] introduces a more complicated version of PSRL that samples many MDPs, instead of just one, and then combines them into an optimistic environment to guide exploration. [sent-40, score-0.196]
27 BEB [17] adds an exploration bonus to states and actions according to how infrequently they have been visited. [sent-41, score-0.148]
28 We show it is not always necessary to introduce optimism via a complicated construction, and that the simple algorithm originally proposed by Strens [16] satisfies strong bounds itself. [sent-42, score-0.199]
29 Our work is motivated by several advantages of posterior sampling relative to optimistic algorithms. [sent-43, score-0.224]
30 Second, the presence of an explicit prior allows an agent to incorporate known environment structure in a natural way. [sent-45, score-0.266]
31 In any optimistic algorithm, performance is greatly influenced by the manner in which optimism is implemented. [sent-48, score-0.241]
32 We demonstrate through a computational study in Section 6 that PSRL outperforms the optimistic algorithm UCRL2 [4]: a competitor with similar regret bounds over some example MDPs. [sent-52, score-0.39]
33 We assume S, A, and τ are deterministic so the agent need not learn the state and action spaces or the time horizon. [sent-58, score-0.279]
34 A deterministic policy µ is a function mapping each state s ∈ S and i = 1, . [sent-59, score-0.204]
35 A policy µ is said to be optimal for MDP M if Vµ,i (s) = M maxµ Vµ ,i (s) for all s ∈ S and i = 1, . [sent-67, score-0.169]
36 We will associate with each MDP M a policy M µ that is optimal for M . [sent-71, score-0.169]
37 The reinforcement learning agent interacts with the MDP over episodes that begin at times tk = (k − 1)τ + 1, k = 1, 2, . [sent-72, score-0.529]
38 At each time t, the agent selects an action at , observes a scalar reward rt , and then transitions to st+1 . [sent-76, score-0.369]
39 If an agent follows a policy µ then when in state s at time t during episode k, it selects an action at = µ(s, t − tk ). [sent-77, score-0.67]
40 A reinforcement learning algorithm is a deterministic sequence {πk |k = 1, 2, . [sent-82, score-0.181]
41 At the start of the kth episode, the algorithm samples a policy µk from the distribution πk (Htk ). [sent-86, score-0.212]
42 The algorithm then selects actions at = µk (st , t − tk ) at times t during the kth episode. [sent-87, score-0.225]
43 We define the regret incurred by a reinforcement learning algorithm π up to time T to be T /τ Regret(T, π) := ∆k , k=1 where ∆k denotes regret over the kth episode, defined with respect to the MDP M ∗ by ∗ ∗ M M ρ(s)(Vµ∗ ,1 (s) − Vµk ,1 (s)), ∆k = s∈S ∗ M∗ with µ = µ and µk ∼ πk (Htk ). [sent-88, score-0.696]
44 Note that regret is not deterministic since it can depend on the random MDP M ∗ , the algorithm’s internal random sampling and, through the history Htk , on previous random transitions and random rewards. [sent-89, score-0.376]
45 We will assess and compare algorithm performance in terms of regret and its expectation. [sent-90, score-0.246]
46 3 Posterior sampling for reinforcement learning The use of posterior sampling for reinforcement learning (PSRL) was first proposed by Strens [16]. [sent-91, score-0.499]
47 PSRL begins with a prior distribution over MDPs with states S, actions A and horizon τ . [sent-92, score-0.156]
48 At the start of each kth episode, PSRL samples an MDP Mk from the posterior distribution conditioned on the history Htk available at that time. [sent-93, score-0.171]
49 PSRL then computes and follows the policy µk = µMk over episode k. [sent-94, score-0.273]
50 We believe that a posterior sampling approach offers some inherent advantages. [sent-102, score-0.14]
51 In addition, M∗ even if strong confidence bounds for Vµ,1 (s) were known, solving for the best optimistic policy may be computationally intractable. [sent-104, score-0.318]
52 Uncertainty about each policy is quantified in a statistically efficient way through the posterior distribution. [sent-108, score-0.256]
53 As such, we believe PSRL will be simpler to implement, computationally cheaper and statistically more efficient than existing optimistic methods. [sent-110, score-0.165]
54 1 Main results √ ˜ The following result establishes regret bounds for PSRL. [sent-112, score-0.289]
55 To accommodate this generality, the result bounds expected regret under the prior distribution (sometimes called Bayes risk or Bayesian regret). [sent-115, score-0.338]
56 We feel this is a natural measure of performance, but should emphasize that it is more common in the literature to bound regret under a worst-case MDP instance. [sent-116, score-0.246]
57 p Tα As shown in the appendix, this also bounds the frequentist regret for any MDP with non-zero probability. [sent-121, score-0.328]
58 Here UCRL2 gives regret √ ˜ bounds O(DS AT ) where D = maxs =s minπ E[T (s |M, π, s)] and T (s |M, π, s) is the first time step where s is reached from s under the policy π. [sent-123, score-0.441]
59 In many practical applications we may be interested in episodic learning tasks where the constants D and Ψ could be improved to take advantage of the episode length√ . [sent-126, score-0.175]
60 Simple τ ˜ modifications to both UCRL2 and REGAL will produce regret bounds of O(τ S AT ), just √ as PSRL. [sent-127, score-0.289]
61 We introduce σ(Htk ) as the σ-algebra generated by the history up to tk . [sent-131, score-0.134]
62 Readers unfamiliar with measure theory can think of this as “all information known just before the start of period tk . [sent-132, score-0.144]
63 ∗ ∗ M M Recall, we have defined ∆k = s∈S ρ(s)(Vµ∗ ,1 (s) − Vµk ,1 (s)) to be the regret over period k. [sent-138, score-0.266]
64 A significant hurdle in analyzing this equation is its dependence on the optimal policy µ∗ , which we do not observe. [sent-139, score-0.169]
65 For many reinforcement learning algorithms, there is no clean way to relate the unknown optimal policy to the states and actions the agent actually observes. [sent-140, score-0.585]
66 First, define ∗ ˜ ∆k = ρ(s)(V Mk (s) − V M (s)) (3) µk ,1 µk ,1 s∈S as the difference in expected value of the policy µk under the sampled MDP Mk , which is known, and its performance under the true MDP M ∗ , which is observed by the agent. [sent-142, score-0.202]
67 This result bounds the agent’s regret in epsiode k by the difference between the agent’s Mk estimate Vµk ,1 (stk ) of the expected reward in Mk from the policy it chooses, and the expected M∗ reward Vµk ,1 (stk ) in M ∗ . [sent-148, score-0.585]
68 If the agent has a poor estimate of the MDP M ∗ , we expect it to learn as the performance of following µk under M ∗ differs from its expectation under Mk . [sent-149, score-0.17]
69 In the next section, we formalize these ideas and give a precise bound on the regret of posterior sampling. [sent-151, score-0.325]
70 5 Analysis An essential tool in our analysis will be the dynamic programming, or Bellman operator M Tµ , which for any MDP M = (S, A, RM , P M , τ, ρ), stationary policy µ : S → A and value function V : S → R, is defined by M M Pµ(s) (s |s)V (s ). [sent-152, score-0.173]
71 M Tµ V (s) := Rµ (s, µ) + s ∈S This operation returns the expected value of state s where we follow the policy µ under the laws of M , for one time step. [sent-153, score-0.206]
72 For any MDP M = (S, A, RM , P M , τ, ρ) M and policy µ : S × {1, . [sent-156, score-0.152]
73 τ τ k ∗ k (Tµk (·,i) − Tµk (·,i) )Vµk ,i+1 (stk +i ) + = i=1 where dtk +i := ∗ s ∈S {Pµk (·,i) (s dtk +i , i=1 ∗ k ∗ k |stk +i )(Vµk ,i+1 − Vµk ,i+1 )(s )} − (Vµk ,i+1 − Vµk ,i+1 )(stk +i ). [sent-168, score-0.128]
74 Crucially, (6) depends only the Bellman error under the observed policy µk and the states s1 , . [sent-171, score-0.178]
75 We go on to show the posterior distribution of Mk concentrates around M ∗ as these actions are sampled, and so this term tends to zero. [sent-174, score-0.152]
76 ∗ k In state st under policy µk , the expected value of (Vµk ,i+1 − Vµk ,i+1 )(stk +i ) is exactly ∗ ∗ k ∗ s ∈S {Pµk (·,i) (s |stk +i )(Vµk ,i+1 − Vµk ,i+1 )(s )}. [sent-176, score-0.277]
77 2 Introducing confidence sets The last section reduced the algorithm’s regret to its expected Bellman error. [sent-179, score-0.263]
78 Finally, define Ntk (s, a) = t=1 1{(st ,at )=(s,a)} to be the number of times (s, a) was sampled prior to time tk . [sent-183, score-0.167]
79 Now, using that ∆k ≤ τ we can decompose regret as follows: m m ˜ ∆k ≤ k=1 m ˜ ∆k 1{Mk ,M ∗ ∈Mk } + τ k=1 [1{Mk ∈Mk } + 1{M ∗ ∈Mk } ] / / k=1 6 (7) Now, since Mk is σ(Htk )-measureable, by Lemma 1, E[1{Mk ∈Mk } |Htk ] = / E[1{M ∗ ∈Mk } |Htk ]. [sent-186, score-0.246]
80 6 Simulation results We compare performance of PSRL to UCRL2 [4]: an optimistic algorithm with similar regret bounds. [sent-191, score-0.347]
81 We provide results in both the episodic case, where the state is reset every τ = 20 steps, as well as the setting without episodic reset. [sent-193, score-0.167]
82 The agent begins at the far left state and at every time step has the choice to swim left or right. [sent-196, score-0.255]
83 The agent receives a small reward for reaching the leftmost state, but the optimal policy is to attempt to swim right and receive a much larger reward. [sent-198, score-0.426]
84 4 In both environments we perform 20 Monte Carlo simulations and compute the total regret over 10,000 time steps. [sent-202, score-0.273]
85 In Figure 2, we show regret through time across 50 Monte Carlo simulations to 100,000 time–steps in the RiverSwim environment: PSRL’s outperformance is quite extreme. [sent-206, score-0.246]
86 64 × 103 Learning in MDPs without episodic resets The majority of practical problems in reinforcement learning can be mapped to repeated episodic interactions for some length τ . [sent-221, score-0.274]
87 Even in cases where there is no actual reset of episodes, one can show that PSRL’s regret is bounded against all policies which work over horizon τ or less [6]. [sent-222, score-0.325]
88 Instead of computing a new policy after a fixed number of periods, they begin a new episode when the total visits to any state-action pair is doubled. [sent-225, score-0.273]
89 Using optimism with KL-divergence instead of L1 balls has also shown improved performance over UCRL2 [22], but its regret remains orders of magnitude more than PSRL on RiverSwim. [sent-227, score-0.386]
90 (a) PSRL outperforms UCRL2 by large margins (b) PSRL learns quickly despite misspecified prior Figure 2: Simulated regret on the ∞-horizon RiverSwim environment. [sent-228, score-0.278]
91 7 Conclusion We establish posterior sampling for reinforcement learning not just as a heuristic, but as a √ ˜ provably efficient learning algorithm. [sent-229, score-0.289]
92 We present O(τ S AT ) Bayesian regret bounds, which are some of the first for an algorithm not motivated by optimism and are close to state of the art for any reinforcement learning algorithm. [sent-230, score-0.607]
93 Compared to feasible optimistic algorithms we believe that PSRL is often more efficient statistically, simpler to implement and computationally cheaper. [sent-233, score-0.14]
94 We believe there is a strong case for the wider adoption of algorithms based upon posterior sampling in both theory and practice. [sent-235, score-0.14]
95 Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. [sent-269, score-0.166]
96 R-max-a general polynomial time algorithm for nearoptimal reinforcement learning. [sent-276, score-0.183]
97 Optimism in reinforcement learning based on kullbacke leibler divergence. [sent-377, score-0.166]
98 9 A Relating Bayesian to frequentist regret Let M be any family of MDPs with non-zero probability under the prior. [sent-380, score-0.285]
99 Then, for any > 0, α > 1: 2 P Regret(T, πτ S ) P > M∗ ∈ M → 0 Tα This provides regret bounds even if M ∗ is not distributed according to f . [sent-381, score-0.289]
100 As long as the true MDP is not impossible under the prior, we will have an asymptotic frequentist regret close to the √ theoretical lower bounds of in T -dependence of O( T ). [sent-382, score-0.328]
wordName wordTfidf (topN-words)
[('psrl', 0.563), ('mk', 0.334), ('htk', 0.274), ('stk', 0.262), ('regret', 0.246), ('mdp', 0.237), ('agent', 0.17), ('reinforcement', 0.166), ('ntk', 0.156), ('policy', 0.152), ('optimism', 0.14), ('episode', 0.121), ('riverswim', 0.113), ('tk', 0.102), ('optimistic', 0.101), ('sat', 0.086), ('posterior', 0.079), ('episodes', 0.073), ('st', 0.071), ('bellman', 0.068), ('regal', 0.065), ('environment', 0.064), ('dtk', 0.064), ('ra', 0.059), ('action', 0.057), ('atk', 0.057), ('reward', 0.055), ('actions', 0.054), ('episodic', 0.054), ('thompson', 0.05), ('mdps', 0.048), ('sampling', 0.044), ('bounds', 0.043), ('exploration', 0.043), ('strens', 0.043), ('stanford', 0.041), ('dence', 0.04), ('nt', 0.04), ('frequentist', 0.039), ('transitions', 0.039), ('kth', 0.038), ('russo', 0.037), ('state', 0.037), ('sampled', 0.033), ('history', 0.032), ('osband', 0.032), ('swim', 0.032), ('prior', 0.032), ('selects', 0.031), ('di', 0.031), ('pa', 0.03), ('policies', 0.029), ('horizon', 0.028), ('environments', 0.027), ('sj', 0.026), ('kolter', 0.026), ('swimming', 0.026), ('states', 0.026), ('statistically', 0.025), ('bonus', 0.025), ('erence', 0.023), ('lemma', 0.023), ('bayesian', 0.022), ('reset', 0.022), ('computationally', 0.022), ('start', 0.022), ('agrawal', 0.021), ('dynamic', 0.021), ('guided', 0.02), ('period', 0.02), ('guarantees', 0.019), ('concentrates', 0.019), ('interacts', 0.018), ('encourage', 0.018), ('art', 0.018), ('bounding', 0.018), ('programming', 0.017), ('polynomial', 0.017), ('ps', 0.017), ('realized', 0.017), ('believe', 0.017), ('uncertainty', 0.017), ('optimal', 0.017), ('expected', 0.017), ('ca', 0.017), ('con', 0.017), ('rt', 0.017), ('decision', 0.016), ('arxiv', 0.016), ('complicated', 0.016), ('begins', 0.016), ('measurable', 0.015), ('deterministic', 0.015), ('rm', 0.015), ('conceptually', 0.015), ('guide', 0.015), ('min', 0.014), ('brafman', 0.014), ('burnetas', 0.014), ('burt', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
Author: Ian Osband, Dan Russo, Benjamin Van Roy
Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1
2 0.24746622 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
3 0.23943385 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari
Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set ⇧ has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. 1
5 0.18761253 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
6 0.17039888 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
7 0.16467176 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
8 0.16466576 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
9 0.15992409 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
10 0.15716341 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
11 0.15122584 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
12 0.1490777 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
13 0.12804979 347 nips-2013-Variational Planning for Graph-based MDPs
14 0.12312374 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
15 0.11994146 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
16 0.11840027 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
17 0.11738504 137 nips-2013-High-Dimensional Gaussian Process Bandits
18 0.11475965 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
19 0.11269473 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
20 0.11168395 348 nips-2013-Variational Policy Search via Trajectory Optimization
topicId topicWeight
[(0, 0.196), (1, -0.333), (2, -0.01), (3, 0.004), (4, -0.021), (5, -0.049), (6, 0.092), (7, -0.042), (8, -0.041), (9, -0.011), (10, 0.03), (11, 0.09), (12, 0.01), (13, -0.009), (14, 0.012), (15, 0.043), (16, 0.066), (17, -0.029), (18, 0.055), (19, -0.066), (20, -0.067), (21, 0.03), (22, 0.006), (23, 0.023), (24, 0.039), (25, 0.038), (26, -0.101), (27, -0.117), (28, 0.018), (29, 0.046), (30, -0.012), (31, -0.016), (32, 0.006), (33, 0.058), (34, -0.154), (35, 0.001), (36, -0.032), (37, -0.022), (38, 0.034), (39, 0.031), (40, -0.005), (41, -0.067), (42, -0.073), (43, 0.079), (44, 0.037), (45, -0.018), (46, 0.067), (47, 0.026), (48, -0.021), (49, -0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.95759714 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
Author: Ian Osband, Dan Russo, Benjamin Van Roy
Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1
2 0.84336245 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
Author: Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet
Abstract: In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of maximin policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed minimax regret as a suitable alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only. We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds.
3 0.83787149 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
4 0.8108151 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
5 0.70747864 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
6 0.68764555 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
7 0.66710621 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
8 0.64294434 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
9 0.64038861 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
10 0.63101596 347 nips-2013-Variational Planning for Graph-based MDPs
12 0.58634955 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
13 0.56319433 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
14 0.55934918 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
15 0.55044705 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
16 0.50661796 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
17 0.50557113 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
18 0.49113601 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
19 0.48466945 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
20 0.47582209 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models
topicId topicWeight
[(3, 0.271), (16, 0.023), (33, 0.125), (34, 0.092), (41, 0.046), (49, 0.03), (56, 0.163), (70, 0.019), (85, 0.07), (89, 0.029), (93, 0.025), (95, 0.012)]
simIndex simValue paperId paperTitle
1 0.85720289 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles
Author: Jinwoo Shin, Andrew E. Gelfand, Misha Chertkov
Abstract: Max-product ‘belief propagation’ (BP) is a popular distributed heuristic for finding the Maximum A Posteriori (MAP) assignment in a joint probability distribution represented by a Graphical Model (GM). It was recently shown that BP converges to the correct MAP assignment for a class of loopy GMs with the following common feature: the Linear Programming (LP) relaxation to the MAP problem is tight (has no integrality gap). Unfortunately, tightness of the LP relaxation does not, in general, guarantee convergence and correctness of the BP algorithm. The failure of BP in such cases motivates reverse engineering a solution – namely, given a tight LP, can we design a ‘good’ BP algorithm. In this paper, we design a BP algorithm for the Maximum Weight Matching (MWM) problem over general graphs. We prove that the algorithm converges to the correct optimum if the respective LP relaxation, which may include inequalities associated with non-intersecting odd-sized cycles, is tight. The most significant part of our approach is the introduction of a novel graph transformation designed to force convergence of BP. Our theoretical result suggests an efficient BP-based heuristic for the MWM problem, which consists of making sequential, “cutting plane”, modifications to the underlying GM. Our experiments show that this heuristic performs as well as traditional cutting-plane algorithms using LP solvers on MWM problems. 1
same-paper 2 0.79193801 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
Author: Ian Osband, Dan Russo, Benjamin Van Roy
Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1
3 0.78283501 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations
Author: Isik B. Fidaner, Taylan Cemgil
Abstract: Infinite mixture models are commonly used for clustering. One can sample from the posterior of mixture assignments by Monte Carlo methods or find its maximum a posteriori solution by optimization. However, in some problems the posterior is diffuse and it is hard to interpret the sampled partitionings. In this paper, we introduce novel statistics based on block sizes for representing sample sets of partitionings and feature allocations. We develop an element-based definition of entropy to quantify segmentation among their elements. Then we propose a simple algorithm called entropy agglomeration (EA) to summarize and visualize this information. Experiments on various infinite mixture posteriors as well as a feature allocation dataset demonstrate that the proposed statistics are useful in practice.
4 0.7187115 277 nips-2013-Restricting exchangeable nonparametric distributions
Author: Sinead A. Williamson, Steve N. MacEachern, Eric Xing
Abstract: Distributions over matrices with exchangeable rows and infinitely many columns are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly-suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution. 1
5 0.65884554 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
Author: Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan
Abstract: Maximization of submodular functions has wide applications in machine learning and artificial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the setting where the expected gain is initially unknown, and it is learned by interacting repeatedly with the optimized function. We propose an efficient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of uncertainty principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem. 1
6 0.65626293 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
7 0.65506476 224 nips-2013-On the Sample Complexity of Subspace Learning
8 0.65480858 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
9 0.65255409 249 nips-2013-Polar Operators for Structured Sparse Estimation
10 0.65169078 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
11 0.65016329 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
12 0.64973611 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
13 0.64889616 89 nips-2013-Dimension-Free Exponentiated Gradient
14 0.64855838 184 nips-2013-Marginals-to-Models Reducibility
15 0.64795959 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
16 0.64775008 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
17 0.64755195 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences
18 0.64746416 230 nips-2013-Online Learning with Costly Features and Labels
19 0.64696759 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
20 0.64693111 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions