nips nips2013 nips2013-32 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gunnar Kedenburg, Raphael Fonteneau, Remi Munos
Abstract: This paper addresses the problem of online planning in Markov decision processes using a randomized simulator, under a budget constraint. We propose a new algorithm which is based on the construction of a forest of planning trees, where each tree corresponds to a random realization of the stochastic environment. The trees are constructed using a “safe” optimistic planning strategy combining the optimistic principle (in order to explore the most promising part of the search space first) with a safety principle (which guarantees a certain amount of uniform exploration). In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. We provide a finite-sample analysis and discuss the trade-off between the principles of optimism and safety. We also report numerical results on a benchmark problem. Our algorithm performs as well as state-of-the-art optimistic planning algorithms, and better than a related algorithm which additionally assumes the knowledge of all transition distributions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract This paper addresses the problem of online planning in Markov decision processes using a randomized simulator, under a budget constraint. [sent-8, score-0.65]
2 We propose a new algorithm which is based on the construction of a forest of planning trees, where each tree corresponds to a random realization of the stochastic environment. [sent-9, score-0.684]
3 The trees are constructed using a “safe” optimistic planning strategy combining the optimistic principle (in order to explore the most promising part of the search space first) with a safety principle (which guarantees a certain amount of uniform exploration). [sent-10, score-1.23]
4 In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. [sent-11, score-0.375]
5 Our algorithm performs as well as state-of-the-art optimistic planning algorithms, and better than a related algorithm which additionally assumes the knowledge of all transition distributions. [sent-14, score-0.704]
6 A new generation of algorithms based on look-ahead tree search techniques have brought a breakthrough in practical performance on planning problems with large state spaces. [sent-17, score-0.55]
7 Techniques based on planning trees such as Monte Carlo tree search [4, 13], and in particular the UCT algorithm (UCB applied to Trees, see [12]) have allowed to tackle large scale problems such as the game of Go [7]. [sent-18, score-0.646]
8 These methods exploit that in order to decide on an action at a given state, it is not necessary to build an estimate of the value function everywhere. [sent-19, score-0.194]
9 We propose a new algorithm for planning in Markov Decision Problems (MDPs). [sent-21, score-0.358]
10 We assume that a limited budget of calls to a randomized simulator for the MDP (the generative model in [11]) is available for exploring the consequences of actions before making a decision. [sent-22, score-0.455]
11 The intuition behind our algorithm is to achieve a high exploration depth in the look-ahead trees by planning in fixed realizations of the MDP, and to achieve the necessary exploration width by aggregating a forest of planning trees (forming an approximation of the MDP from many realizations). [sent-23, score-1.411]
12 Each of the trees is developed around the state for which a decision has to be made, according to the principle of optimism in the face of uncertainty [13] combined with a safety principle. [sent-24, score-0.405]
13 1 We provide a finite-sample analysis depending on the budget, split into the number of trees and the number of node expansions in each tree. [sent-25, score-0.191]
14 We show that our algorithm is consistent and that it identifies the optimal action when given a sufficiently large budget. [sent-26, score-0.214]
15 We provide numerical results on the inverted pendulum benchmark in section 6. [sent-32, score-0.266]
16 See [13] for a detailed review of the optimistic principle applied to planning and optimization. [sent-36, score-0.679]
17 A Bayesian adaptation of OP-MDP has also been proposed [6] for planning in the context where the MDP is unknown. [sent-38, score-0.358]
18 Our contribution is also related to [5], where random ensembles of state-action independent disturbance scenarios are built, the planning problem is solved for each scenario, and a decision is made based on majority voting. [sent-39, score-0.47]
19 3 Formalization Let (S, A, p, r, γ) be a Markov decision process (MDP), where the set S and A respectively denote the state space and the finite action space, with |A| > 1, of the MDP. [sent-41, score-0.337]
20 When an action a ∈ A is selected in state s ∈ S of the MDP, it transitions to a successor state s ∈ S(s, a) with probability p(s |s, a). [sent-42, score-0.564]
21 We further assume that every successor state set S(s, a) is finite and their cardinality is bounded by K ∈ N. [sent-43, score-0.273]
22 While the transition probabilities may be unknown, it is assumed that a randomized simulator is available, which, given a state-action pair (s, a), outputs a successor state s ∼ p(·|s, a). [sent-45, score-0.525]
23 In this paper we consider the problem of planning under a budget constraint: only a limited number of samples may be drawn using the simulator. [sent-47, score-0.548]
24 Define the value function of the policy π in a state s as the discounted sum of expected rewards: ∞ v π : S → R, v π : s → E γ t r(st , π(st ), st+1 ) s0 = s , (1) t=0 where the constant γ ∈ (0, 1) is called the discount factor. [sent-50, score-0.245]
25 We refer to the planning trees used here as single successor state trees (S3-trees), in order to distinguish them from other planning trees used for the same problem (e. [sent-58, score-1.469]
26 the OP-MDP tree, where all possible successor states are considered). [sent-60, score-0.245]
27 Every node of a S3-tree represents a state s ∈ S, and has at most one child node per state-action a, representing a successor state s ∈ S. [sent-61, score-0.421]
28 The successor state is drawn using the simulator during the construction of the S3-tree. [sent-62, score-0.416]
29 The planning tree construction, using the SOP algorithm (for “Safe Optimistic Planning”), is described in section 4. [sent-63, score-0.466]
30 The ASOP algorithm, which integrates building the forest and deciding on an action by aggregating the information in the forest, is described in section 4. [sent-65, score-0.366]
31 1 Safe optimistic planning in S3-trees: the SOP algorithm SOP is an algorithm for sequentially constructing a S3-tree. [sent-68, score-0.648]
32 It can be seen as a variant of the OPD algorithm [9] for planning in deterministic MDPs. [sent-69, score-0.409]
33 SOP expands up to two leaves of the planning tree per iteration. [sent-70, score-0.495]
34 The first leaf (the optimistic one) is a maximizer of an upper bound (called b-value) on the value function of the (deterministic) realization of the MDP explored in the S3-tree. [sent-71, score-0.487]
35 The b-value of a node x is defined as d(x)−1 γ d(x) b(x) := γ i ri + (2) 1−γ i=0 where (ri ) is the sequence of rewards obtained along the path to x, and d(x) is the depth of the node (the length of the path from the root to x). [sent-72, score-0.258]
36 Only expanding the optimistic leaf would not be enough to make ASOP consistent; this is shown in the appendix. [sent-73, score-0.397]
37 Therefore, a second leaf (the safe one), defined as the shallowest leaf in the current tree, is also expanded in each iteration. [sent-74, score-0.389]
38 Algorithm 1: SOP Data: The initial state s0 ∈ S and a budget n ∈ N Result: A planning tree T Let T denote a tree consisting only of a leaf, representing s0 . [sent-76, score-0.85]
39 while c < n do Form a subset of leaves of T , L, containing a leaf of minimal depth, and a leaf of maximal b-value (computed according to (2); the two leaves can be identical). [sent-78, score-0.252]
40 foreach a ∈ A do if c < n then Use the simulator to draw a successor state s ∼ p(·|s, a). [sent-80, score-0.462]
41 ASOP then outputs the action ˆ π (s0 ) ∈ argmax Q∗ (s0 , a). [sent-89, score-0.213]
42 ˆ a∈A The optimal policy of the empirical MDP has the property that the empirical lower bound of its value, computed from the information collected by planning in the individual realizations, is maximal over the set of all policies. [sent-90, score-0.51]
43 Algorithm 2: ActionValue Data: A forest F and an action a, with each tree in F representing the same state s Result: An empirical lower bound for the value of a in s Let E denote the edges representing action a at any of the root nodes of F . [sent-92, score-0.765]
44 if E = ∅ then return 0 else Let F be the set of trees pointed to by the edges in E. [sent-93, score-0.182]
45 foreach i ∈ I do Denote the set of trees in F which represent si by Fi . [sent-95, score-0.224]
46 ˆ ˆ ˆ return i∈I pi (r(s, a, si ) + γ νi ) Algorithm 3: ASOP Data: The initial state s0 , a per-tree budget b ∈ N and the forest size m ∈ N Result: An action to take for i = 1, . [sent-98, score-0.626]
47 , Tm }, a) 5 Finite-sample analysis In this section, we provide a finite-sample analysis of ASOP in terms of the number of planning trees m and per-tree budget n. [sent-105, score-0.708]
48 An immediate consequence of this analysis is that ASOP is consistent: the action returned by ASOP converges to the optimal action when both n and m tend to infinity. [sent-106, score-0.433]
49 ˆ First, let us use the “safe” part of SOP to show that each S3-tree is fully explored up to a certain depth d when given a sufficiently large per-tree budget n. [sent-108, score-0.299]
50 For any d ∈ N, once a budget of n ≥ 2|A| |A| −1 has been spent by SOP on an S3-tree, |A|−1 the state-actions of all nodes up and including those at depth d have all been sampled exactly once. [sent-110, score-0.272]
51 We complete the proof by noticing that SOP spends at least half of its budget on shallowest leaves. [sent-114, score-0.25]
52 d π π Let vω and vω,n denote the value functions for a policy π in the infinite, completely explored S3-tree defined by a random realization ω and the finite S3-tree constructed by SOP for a budget of n in the same realization ω, respectively. [sent-115, score-0.491]
53 From Lemma 1 we deduce that if the per-tree budget is at least n≥2 π π we obtain |vω (s0 ) − vω,n (s0 )| ≤ log |A| |A| − [ (1 − γ)] log(1/γ) . [sent-116, score-0.239]
54 ASOP aggregates the trees and computes the optimal policy π of the resulting empirical MDP whose ˆ transition probabilities are defined by the frequencies (over the m S3-trees) of transitions from state-action to successor states. [sent-118, score-0.639]
55 (4) i=1 If the number m of S3-trees and the per-tree budget n are large, we therefore expect the optimal policy π of the empirical MDP to be close to the optimal policy π ∗ of the true MDP. [sent-120, score-0.454]
56 For any δ ∈ (0, 1) and ∈ (0, 1), if the number of S3-trees is at least m≥ 8 2 (1 − γ)2 log |A| K (1 − γ) K −1 4 log K − log(1/γ) + log(4/δ) (5) and the per-tree budget is at least n≥2 log |A| − log(1/γ) |A| (1 − γ) |A| − 1 4 , (6) then P (Rm,n (s0 ) < ) ≥ 1 − δ. [sent-123, score-0.262]
57 Analogously to (3), we know from Lemma 1 that, given our choice of n, SOP constructs trees which d+1 (1−γ)/4) are completely explored up to depth d := log( log γ , fulfilling γ 1−γ ≤ 4 . [sent-131, score-0.293]
58 Since the trees are complete up to level d and the rewards are non-negative, we deduce that we have π π 0 ≤ vωi ,n − νωi ,d ≤ 4 for each i and each policy π, thus the same will be true for the averages: 0 ≤ vm,n − νm,d ≤ ˆπ ˆπ 4 ∀π. [sent-134, score-0.354]
59 The number of distinct policies is d |A| · |A|K · · · · · |A|K (at each level l, there are at most K l states that can be reached by following a 5 l policy at previous levels, so there are |A|K different choices that policies can make at level l). [sent-138, score-0.22]
60 (8) The action returned by ASOP is π (s0 ), where π := argmaxπ vm,n . [sent-140, score-0.219]
61 The total budget (nm) required to return an -optimal action with high probability is log(K|A|) thus of order −2− log(1/γ) . [sent-142, score-0.406]
62 Notice that this rate is poorer (by a −2 factor) than the rate obtained for uniform planning in [2]; this is a direct consequence of the fact that we are only drawing samples, whereas a full model of the transition probabilities is assumed in [2]. [sent-143, score-0.444]
63 Since there is a finite number of actions, by denoting ∆ > 0 the optimality gap between the best and the second-best optimal action values, we have that the optimal arm is identified (in high log(K|A|) probability) (i. [sent-145, score-0.234]
64 the simple regret is 0) after a total budget of order ∆−2− log(1/γ) . [sent-147, score-0.21]
65 The optimistic part of the algorithm allows a deep exploration of the MDP. [sent-149, score-0.328]
66 Notice that we do not use the optimistic properties of the algorithm in the analysis. [sent-153, score-0.29]
67 An analysis of the benefit of the optimistic part of the algorithm, similar to the analyses carried out in [9, 2] would be much more involved and is deferred to a future work. [sent-157, score-0.29]
68 However the impact of the optimistic part of the algorithm is essential in practice, as shown in the numerical results. [sent-158, score-0.313]
69 We use the (noisy) inverted pendulum benchmark problem from [2], which consists of swinging up and stabilizing a weight attached to an actuated link that rotates in a vertical plane. [sent-160, score-0.243]
70 Since the available power is too low to push the pendulum up in a single rotation from the initial state, the pendulum has to be swung back and forth to gather energy, prior to being pushed up and stabilized. [sent-161, score-0.284]
71 The inverted pendulum is described by the state variables (α, α) ∈ [−π, π] × [−15, 15] and the ˙ differential equation α = (mgl sin(α) − bα − K(K α + u)/R) /J, where J = 1. [sent-162, score-0.273]
72 The goal is to stabilize the pendulum in the unstable equilibrium s∗ = (0, 0) (pointing up, at rest) when starting from state (−π, 0) (pointing down, at rest). [sent-176, score-0.206]
73 8 102 103 104 Calls to the simulator per step Figure 1: Comparison of ASOP to OP-MDP, UCT, and FSSS on the inverted pendulum benchmark problem, showing the sum of discounted rewards for simulations of 50 time steps. [sent-188, score-0.494]
74 In the cases of ASOP, UCT, and FSSS, the budget is in terms of calls to the simulator. [sent-190, score-0.228]
75 Instead, every possible successor state is incorporated into the planning tree, together with its precise probability mass, and each of these states is counted against the budget. [sent-192, score-0.667]
76 Figure 2 shows the impact of optimistic planning on the performance of our aggregation method. [sent-203, score-0.683]
77 For forest sizes of both one and three, optimistic planning leads to considerably increased performance. [sent-204, score-0.786]
78 This is due to the greater planning depth in the lookahead tree when using optimistic exploration. [sent-205, score-0.815]
79 7 Conclusion We introduced ASOP, a novel algorithm for solving online planning problems using a (randomized) simulator for the MDP, under a budget constraint. [sent-208, score-0.691]
80 The algorithm works by constructing a forest of single successor state trees, each corresponding to a random realization of the MDP transitions. [sent-209, score-0.471]
81 Each tree is constructed using a combination of safe and optimistic planning. [sent-210, score-0.572]
82 An empirical MDP is defined, based on the forest, and the first action of the optimal policy of this empirical MDP is returned. [sent-211, score-0.326]
83 in [13]) by using the optimistic principle to focus rapidly on the most promising area(s) of the search space. [sent-214, score-0.341]
84 It can also find a reasonable solution in unstructured problems, since some of the budget is allocated for uniform exploration. [sent-215, score-0.19]
85 ASOP shows good performance on the inverted pendulum benchmark. [sent-216, score-0.209]
86 Finally, our algorithm is also appealing in that the numerically heavy part of constructing the planning trees, in which the simulator is used, can be performed in a distributed way. [sent-217, score-0.521]
87 5 101 102 103 104 Calls to the simulator per step Figure 2: Comparison of different planning strategies (on the same problem as in figure 1). [sent-220, score-0.501]
88 The “Safe” strategy is to use uniform planning in the individual trees, the “Optimistic” strategy is to use OPD. [sent-221, score-0.358]
89 Appendix: Counterexample to consistency when using purely optimistic planning in S3-trees Consider the MDP in figure 3 with k zero reward transitions in the middle branch, where γ ∈ (0, 1) and k ∈ N are chosen such that 1 > γ k > 1 (e. [sent-229, score-0.75]
90 The trees are constructed 2 3 iteratively, and every iteration consists of exploring a leaf of maximal b-value, where exploring a leaf means introducing a single successor state per action at the selected leaf. [sent-233, score-0.882]
91 There are two 33 2 possible outcomes when sampling the action a, which occur with probabilities 1 and 2 , respectively: 3 3 Outcome I: The upper branch of action a is sampled. [sent-235, score-0.473]
92 In this case, the contribution to the forest is an 1 arbitrarily long reward 1 path for action a, and a finite reward 2 path for action b. [sent-236, score-0.724]
93 k γ 1 1 Outcome II: The lower branch of action a is sampled. [sent-237, score-0.249]
94 Because 1−γ < 2 1−γ , the lower branch will be explored only up to k times, as its b-value is then lower than the value (and therefore any b-value) of action b. [sent-238, score-0.299]
95 The contribution of this case to the forest is a finite reward 0 path for action a and an arbitrary long (depending on the budget) reward 1 path for action b. [sent-239, score-0.724]
96 2 For an increasing exploration budget per tree and an increasing number of trees, the approximate 1 1 action values of action a and b obtained by aggregation converge to 1 1−γ and 1 1−γ , respectively. [sent-240, score-0.759]
97 3 2 Therefore, the decision rule will select action b for a sufficiently large budget, even though a is the 1 1 optimal action. [sent-241, score-0.293]
98 Lazy planning under uncertainty by optimizing decisions on an ensemble of incomplete disturbance trees. [sent-288, score-0.391]
99 A sparse sampling algorithm for near-optimal planning in large Markov decision processes. [sent-328, score-0.437]
100 From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning. [sent-337, score-0.321]
wordName wordTfidf (topN-words)
[('asop', 0.461), ('planning', 0.358), ('optimistic', 0.29), ('mdp', 0.238), ('sop', 0.23), ('successor', 0.209), ('action', 0.194), ('budget', 0.19), ('fsss', 0.16), ('trees', 0.16), ('safe', 0.155), ('simulator', 0.143), ('pendulum', 0.142), ('uct', 0.138), ('forest', 0.138), ('policy', 0.112), ('tree', 0.108), ('leaf', 0.087), ('decision', 0.079), ('reward', 0.069), ('realizations', 0.068), ('inverted', 0.067), ('state', 0.064), ('actionvalue', 0.06), ('fonteneau', 0.06), ('shallowest', 0.06), ('mdps', 0.06), ('realization', 0.06), ('depth', 0.059), ('rewards', 0.057), ('transition', 0.056), ('branch', 0.055), ('reinforcement', 0.054), ('busoniu', 0.053), ('deterministic', 0.051), ('discounted', 0.051), ('explored', 0.05), ('foreach', 0.046), ('nord', 0.044), ('inria', 0.041), ('actions', 0.04), ('optimism', 0.04), ('lille', 0.04), ('opd', 0.04), ('exploration', 0.038), ('calls', 0.038), ('europe', 0.037), ('states', 0.036), ('policies', 0.036), ('rapha', 0.035), ('aggregation', 0.035), ('benchmark', 0.034), ('aggregating', 0.034), ('disturbance', 0.033), ('kg', 0.033), ('transitions', 0.033), ('node', 0.031), ('principle', 0.031), ('safety', 0.031), ('humanoid', 0.031), ('path', 0.03), ('probabilities', 0.03), ('leaves', 0.029), ('voltage', 0.028), ('argmaxa', 0.027), ('acting', 0.026), ('bandit', 0.025), ('deduce', 0.025), ('returned', 0.025), ('remark', 0.024), ('log', 0.024), ('fi', 0.024), ('markov', 0.024), ('randomized', 0.023), ('numerical', 0.023), ('munos', 0.023), ('nodes', 0.023), ('segments', 0.023), ('representing', 0.022), ('pointing', 0.022), ('return', 0.022), ('exploring', 0.021), ('aggregated', 0.021), ('expanding', 0.02), ('truncation', 0.02), ('st', 0.02), ('maximal', 0.02), ('stochastic', 0.02), ('nm', 0.02), ('regret', 0.02), ('ri', 0.02), ('search', 0.02), ('numerically', 0.02), ('optimal', 0.02), ('argmax', 0.019), ('frequencies', 0.019), ('constructed', 0.019), ('discount', 0.018), ('optimally', 0.018), ('si', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
Author: Gunnar Kedenburg, Raphael Fonteneau, Remi Munos
Abstract: This paper addresses the problem of online planning in Markov decision processes using a randomized simulator, under a budget constraint. We propose a new algorithm which is based on the construction of a forest of planning trees, where each tree corresponds to a random realization of the stochastic environment. The trees are constructed using a “safe” optimistic planning strategy combining the optimistic principle (in order to explore the most promising part of the search space first) with a safety principle (which guarantees a certain amount of uniform exploration). In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. We provide a finite-sample analysis and discuss the trade-off between the principles of optimism and safety. We also report numerical results on a benchmark problem. Our algorithm performs as well as state-of-the-art optimistic planning algorithms, and better than a related algorithm which additionally assumes the knowledge of all transition distributions. 1
2 0.28906184 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
Author: Aijun Bai, Feng Wu, Xiaoping Chen
Abstract: Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning and learning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. To address this, we present a novel approach for MCTS using Bayesian mixture modeling and inference based Thompson sampling and apply it to the problem of online planning in MDPs. Our algorithm, named Dirichlet-NormalGamma MCTS (DNG-MCTS), models the uncertainty of the accumulated reward for actions in the search tree as a mixture of Normal distributions. We perform inferences on the mixture in Bayesian settings by choosing conjugate priors in the form of combinations of Dirichlet and NormalGamma distributions and select the best action at each decision node using Thompson sampling. Experimental results confirm that our algorithm advances the state-of-the-art UCT approach with better values on several benchmark problems. 1
3 0.21917239 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
4 0.21031451 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee
Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difficulties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1
5 0.17428015 347 nips-2013-Variational Planning for Graph-based MDPs
Author: Qiang Cheng, Qiang Liu, Feng Chen, Alex Ihler
Abstract: Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at finding good policies. 1
6 0.17212467 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
7 0.16662909 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
8 0.16466576 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
9 0.16283727 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
10 0.15633819 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
11 0.14353095 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models
12 0.13952589 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
13 0.13721502 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
14 0.12865008 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
15 0.12813495 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
16 0.12697838 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
17 0.12514289 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
18 0.11024979 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
19 0.1059875 257 nips-2013-Projected Natural Actor-Critic
20 0.10500304 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
topicId topicWeight
[(0, 0.188), (1, -0.288), (2, -0.113), (3, 0.07), (4, 0.022), (5, -0.016), (6, 0.063), (7, -0.064), (8, -0.066), (9, 0.052), (10, 0.035), (11, 0.014), (12, 0.115), (13, -0.021), (14, -0.009), (15, 0.103), (16, 0.064), (17, -0.002), (18, 0.048), (19, -0.107), (20, 0.026), (21, 0.159), (22, -0.035), (23, 0.01), (24, 0.07), (25, 0.028), (26, -0.038), (27, -0.125), (28, 0.01), (29, -0.009), (30, 0.028), (31, -0.011), (32, -0.079), (33, 0.081), (34, 0.06), (35, 0.008), (36, -0.034), (37, 0.1), (38, 0.012), (39, -0.096), (40, -0.035), (41, -0.002), (42, -0.054), (43, -0.003), (44, -0.034), (45, -0.004), (46, -0.071), (47, -0.02), (48, -0.072), (49, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.97023118 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
Author: Gunnar Kedenburg, Raphael Fonteneau, Remi Munos
Abstract: This paper addresses the problem of online planning in Markov decision processes using a randomized simulator, under a budget constraint. We propose a new algorithm which is based on the construction of a forest of planning trees, where each tree corresponds to a random realization of the stochastic environment. The trees are constructed using a “safe” optimistic planning strategy combining the optimistic principle (in order to explore the most promising part of the search space first) with a safety principle (which guarantees a certain amount of uniform exploration). In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. We provide a finite-sample analysis and discuss the trade-off between the principles of optimism and safety. We also report numerical results on a benchmark problem. Our algorithm performs as well as state-of-the-art optimistic planning algorithms, and better than a related algorithm which additionally assumes the knowledge of all transition distributions. 1
2 0.85583055 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
Author: Aijun Bai, Feng Wu, Xiaoping Chen
Abstract: Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning and learning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. To address this, we present a novel approach for MCTS using Bayesian mixture modeling and inference based Thompson sampling and apply it to the problem of online planning in MDPs. Our algorithm, named Dirichlet-NormalGamma MCTS (DNG-MCTS), models the uncertainty of the accumulated reward for actions in the search tree as a mixture of Normal distributions. We perform inferences on the mixture in Bayesian settings by choosing conjugate priors in the form of combinations of Dirichlet and NormalGamma distributions and select the best action at each decision node using Thompson sampling. Experimental results confirm that our algorithm advances the state-of-the-art UCT approach with better values on several benchmark problems. 1
3 0.74240959 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee
Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difficulties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1
4 0.722426 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models
Author: Tuan A. Nguyen, Subbarao Kambhampati, Minh Do
Abstract: Most current planners assume complete domain models and focus on generating correct plans. Unfortunately, domain modeling is a laborious and error-prone task, thus real world agents have to plan with incomplete domain models. While domain experts cannot guarantee completeness, often they are able to circumscribe the incompleteness of the model by providing annotations as to which parts of the domain model may be incomplete. In such cases, the goal should be to synthesize plans that are robust with respect to any known incompleteness of the domain. In this paper, we first introduce annotations expressing the knowledge of the domain incompleteness and formalize the notion of plan robustness with respect to an incomplete domain model. We then show an approach to compiling the problem of finding robust plans to the conformant probabilistic planning problem, and present experimental results with Probabilistic-FF planner. 1
5 0.70995075 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
6 0.67797166 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
7 0.67016339 347 nips-2013-Variational Planning for Graph-based MDPs
8 0.66394264 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
9 0.6460337 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
10 0.64351428 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
11 0.62498939 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
12 0.5855993 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
13 0.54638785 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
14 0.53873098 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
15 0.52774543 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
16 0.49063841 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
17 0.48637724 340 nips-2013-Understanding variable importances in forests of randomized trees
18 0.48184016 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
19 0.48095903 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
20 0.45899665 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
topicId topicWeight
[(2, 0.013), (16, 0.035), (33, 0.091), (34, 0.116), (36, 0.015), (41, 0.04), (49, 0.035), (56, 0.174), (70, 0.032), (85, 0.06), (88, 0.019), (89, 0.021), (93, 0.049), (94, 0.214)]
simIndex simValue paperId paperTitle
same-paper 1 0.83219528 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
Author: Gunnar Kedenburg, Raphael Fonteneau, Remi Munos
Abstract: This paper addresses the problem of online planning in Markov decision processes using a randomized simulator, under a budget constraint. We propose a new algorithm which is based on the construction of a forest of planning trees, where each tree corresponds to a random realization of the stochastic environment. The trees are constructed using a “safe” optimistic planning strategy combining the optimistic principle (in order to explore the most promising part of the search space first) with a safety principle (which guarantees a certain amount of uniform exploration). In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. We provide a finite-sample analysis and discuss the trade-off between the principles of optimism and safety. We also report numerical results on a benchmark problem. Our algorithm performs as well as state-of-the-art optimistic planning algorithms, and better than a related algorithm which additionally assumes the knowledge of all transition distributions. 1
2 0.77433968 325 nips-2013-The Pareto Regret Frontier
Author: Wouter M. Koolen
Abstract: Performance guarantees for online learning algorithms typically take the form of regret bounds, which express that the cumulative loss overhead compared to the best expert in hindsight is small. In the common case of large but structured expert sets we typically wish to keep the regret especially small compared to simple experts, at the cost of modest additional overhead compared to more complex others. We study which such regret trade-offs can be achieved, and how. We analyse regret w.r.t. each individual expert as a multi-objective criterion in the simple but fundamental case of absolute loss. We characterise the achievable and Pareto optimal trade-offs, and the corresponding optimal strategies for each sample size both exactly for each finite horizon and asymptotically. 1
3 0.76568097 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
Author: Ryota Tomioka, Taiji Suzuki
Abstract: We study a new class of structured Schatten norms for tensors that includes two recently proposed norms (“overlapped” and “latent”) for convex-optimizationbased tensor decomposition. We analyze the performance of “latent” approach for tensor decomposition, which was empirically found to perform better than the “overlapped” approach in some settings. We show theoretically that this is indeed the case. In particular, when the unknown true tensor is low-rank in a specific unknown mode, this approach performs as well as knowing the mode with the smallest rank. Along the way, we show a novel duality result for structured Schatten norms, which is also interesting in the general context of structured sparsity. We confirm through numerical simulations that our theory can precisely predict the scaling behaviour of the mean squared error. 1
4 0.74395764 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee
Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difficulties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1
5 0.74204373 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
Author: Adel Javanmard, Andrea Montanari
Abstract: In the high-dimensional regression model a response variable is linearly related to p covariates, but the sample size n is smaller than p. We assume that only a small subset of covariates is ‘active’ (i.e., the corresponding coefficients are non-zero), and consider the model-selection problem of identifying the active covariates. A popular approach is to estimate the regression coefficients through the Lasso ( 1 -regularized least squares). This is known to correctly identify the active set only if the irrelevant covariates are roughly orthogonal to the relevant ones, as quantified through the so called ‘irrepresentability’ condition. In this paper we study the ‘Gauss-Lasso’ selector, a simple two-stage method that first solves the Lasso, and then performs ordinary least squares restricted to the Lasso active set. We formulate ‘generalized irrepresentability condition’ (GIC), an assumption that is substantially weaker than irrepresentability. We prove that, under GIC, the Gauss-Lasso correctly recovers the active set. 1
6 0.73103464 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
7 0.73059309 249 nips-2013-Polar Operators for Structured Sparse Estimation
8 0.72586304 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
9 0.72458953 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
10 0.72400308 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
11 0.72299868 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
12 0.72266209 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
13 0.72123212 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
14 0.72103882 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
15 0.71925235 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
16 0.7190572 63 nips-2013-Cluster Trees on Manifolds
17 0.7188198 357 nips-2013-k-Prototype Learning for 3D Rigid Structures
18 0.71881402 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
19 0.71773261 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
20 0.71701723 252 nips-2013-Predictive PAC Learning and Process Decompositions