nips nips2011 nips2011-174 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Monte Carlo Value Iteration with Macro-Actions Zhan Wei Lim David Hsu Wee Sun Lee Department of Computer Science, National University of Singapore Singapore, 117417, Singapore Abstract POMDP planning faces two major computational challenges: large state spaces and long planning horizons. [sent-1, score-0.603]
2 The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. [sent-2, score-0.465]
3 1 Introduction Partially observable Markov decision process (POMDP) provides a principled general framework for planning with imperfect state information. [sent-7, score-0.398]
4 In POMDP planning, we represent an agent’s possible states probabilistically as a belief and systematically reason over the space of all beliefs in order to derive a policy that is robust under uncertainty. [sent-8, score-0.497]
5 A complex planning task involves a large number of states, which result in a high-dimensional belief space. [sent-11, score-0.377]
6 In applications such as robot motion planning, an agent often takes many actions before reaching the goal, resulting in a long planning horizon. [sent-13, score-0.381]
7 The complexity of the planning task grows very fast with the horizon. [sent-14, score-0.234]
8 It can tackle POMDPs with very large discrete state spaces or continuous state spaces. [sent-18, score-0.231]
9 The main idea of MCVI is to sample both an agent’s state space and the corresponding belief space simultaneously, thus avoiding the prohibitive computational cost of unnecessarily processing these spaces in their entirety. [sent-19, score-0.278]
10 It uses Monte Carlo sampling in conjunction with dynamic programming to compute a policy represented as a finite state controller. [sent-20, score-0.407]
11 However, the performance of MCVI degrades, as the planning horizon increases. [sent-22, score-0.276]
12 Unfortunately, the theoretical properties of MCVI, such as the approximation error bounds [2], do not carry over to Macro-MCVI automatically, if arbitrary mapping from belief to actions are allowed as macro-actions. [sent-25, score-0.198]
13 A major advantage of the new algorithm is its ability to abstract away the lengths of macro-actions in planning and reduce the effect of long planning horizons. [sent-27, score-0.468]
14 Macro-MCVI can also be used to construct a hierarchy of macro-actions for planning large spaces. [sent-30, score-0.234]
15 2 Related Works Macro-actions have long been used to speed up planning and learning algorithms for MDPs (see, e. [sent-32, score-0.234]
16 Similarly, they have been used in offline policy computation for POMDPs [16, 8]. [sent-35, score-0.311]
17 The earlier work uses finite state controllers for policy representation and policy iteration for policy computation, but it has not yet been shown to work on large state spaces. [sent-40, score-1.194]
18 Expectation-maximization (EM) can be used to train finite state controllers [17] and potentially handle large state spaces, but it often gets stuck in local optima. [sent-41, score-0.229]
19 A very powerful representation for a macro-action would be to allow it to be an arbitrary mapping from belief to action that will run until some termination condition is met. [sent-46, score-0.258]
20 Consider two macro-actions, one which selects the poorer primitive action all the time while the other which selects the better primitive action for some beliefs. [sent-50, score-0.302]
21 The reward function also forms the optimal value function of the process and need not even be continuous as the macro-action can be an arbitrary mapping from belief to action. [sent-53, score-0.28]
22 Our POSMDP is formally defined as a tuple (S, A, O, T, R, γ), where S is a state space, A is a macro-action space, O is a macro-observation space, T is a joint transition and observation function, R is a reward function, and γ ∈ (0, 1) is a discount factor. [sent-61, score-0.256]
23 If we apply a macro-action a with start state si , T = p(sj , o, k|si , a) encodes the joint conditional probability of the end state sj , macro-observation o, and the number of time steps k that it takes for a to reach sj from si . [sent-62, score-0.49]
24 The reward function R gives the discounted cumulative reward for a ∞ macro-action a that starts at state s: R(s, a) = t=0 γ t E(rt |s, a), where E(rt |s, a) is the expected reward at step t. [sent-64, score-0.42]
25 Assuming that the number of states is n, a reweighted belief (like a belief) is a vector of n non-negative numbers that sums to 2 one. [sent-67, score-0.22]
26 By assuming that the POSMDP process will stop with probability 1−γ at each time step, we can interpret the reweighted belief as the conditional probability of a state given that the process has not stopped. [sent-68, score-0.316]
27 This gives an interpretation of the reweighted belief in terms of the discount factor. [sent-69, score-0.22]
28 Given a reweighted belief, we compute the next reweighted belief given macroaction a and observation o, b = τ (b, a, o), as follows: b (s) = ∞ n k−1 k=1 γ i=1 p(s, o, k|si , a)b(si ) . [sent-70, score-0.42]
29 n n ∞ γ k−1 j=0 i=1 p(sj , o, k|si , a)b(si ) k=1 (1) We will simply refer to the reweighted belief as a belief from here on. [sent-71, score-0.363]
30 A policy π is a mapping from a belief to a macro-action. [sent-75, score-0.454]
31 Let R(b, a) = of a policy π can be defined recursively as Vπ (b) = R(b, π(b)) + γ s b(s)R(s, a). [sent-76, score-0.311]
32 o Note that the policy operates on the belief and may not know the number of steps taken by the macro-actions. [sent-78, score-0.454]
33 We call a policy an m-step policy if the number of times the macro-actions is applied is m. [sent-86, score-0.622]
34 Theorem 2 The value function for an m-step policy is piecewise linear and convex and can be represented as Vm (b) = max α(s)b(s) (3) α∈Γm s∈S where Γm is a finite collection of α-vectors. [sent-88, score-0.34]
35 2 Macro-action Construction We would like to construct macro-actions from primitive actions of a POMDP in order to use temporal abstraction to help solve difficult POMDP problems. [sent-91, score-0.226]
36 A partially observable Markov decision process (POMDP) is defined by finite state space S, finite action space A, a reward function R(s, a), an observation space O, and a discount γ ∈ (0, 1). [sent-92, score-0.407]
37 In our POSMDP, the probability function p(sj , o, k|si , a) for a macro-action must be independent of the history given the current state si ; hence the selection of primitive actions and termination conditions within the macro-action cannot depend on the belief. [sent-93, score-0.45]
38 Due to partial observability, it is often not possible to allow the primitive action and the termination condition to be functions of the initial state. [sent-95, score-0.21]
39 Instead of α-vectors, MCVI uses an alternative policy representation called a policy graph G. [sent-113, score-0.68]
40 A policy graph is a directed graph with labeled nodes and edges. [sent-114, score-0.427]
41 To execute a policy πG , it is treated as a finite state controller whose states are the nodes of G. [sent-116, score-0.478]
42 Given an initial belief b, a starting node v of G is selected and its associated macro-action av is performed. [sent-117, score-0.184]
43 Let πG,v denote a policy represented by G, when the controller always starts in node v of G. [sent-120, score-0.423]
44 We define the value αv (s) to be the expected total reward of executing πG,v with initial state s. [sent-121, score-0.233]
45 1 MC-Backup One way to approximate the value function is to repeatedly run the backup operator H starting from an arbitrary value function until it is close to convergence. [sent-125, score-0.18]
46 Value iteration can be carried out on policy graphs as well, as it provides an implicit representation of a value function. [sent-127, score-0.372]
47 Let VG be the value function for a policy graph G. [sent-128, score-0.398]
48 (5) s∈S It is possible to then evaluate the right-hand side of (5) via sampling and monte carlo simulation at a ˆ belief b. [sent-130, score-0.238]
49 The outcome is a new policy graph G with value function Hb VG . [sent-131, score-0.398]
50 There are |A||G||O| possible ways to generate a new policy graph G which has one new node compared to the old policy graph node. [sent-133, score-0.779]
51 Algorithm 1 computes an estimate of the best new policy graph at b using only N |A||G| samples. [sent-134, score-0.369]
52 4 Algorithm 1 MC-Backup of a policy graph G at a belief b ∈ B with N samples. [sent-137, score-0.512]
53 3: for each action a ∈ A do 4: for i = 1 to N do 5: Sample a state si with probability b(si ). [sent-140, score-0.25]
54 Generate a new state si , observation oi , and discounted reward R (si , a) by sampling from p(sj , o, k|si , a). [sent-142, score-0.354]
55 8: for each node v ∈ G do 9: Set V to be the expected total reward of simulating the policy represented by G, with initial controller state v and initial state si . [sent-144, score-0.821]
56 17: Create a new policy graph G by adding a new node u to G. [sent-151, score-0.41]
57 Theorem 3 Given a policy graph G and a point b ∈ B, MC-BACKUP(G, b, N ) produces an improved policy graph such that ˆ |Hb VG (b) − HVG (b)| ≤ 2Rmax 1−γ 2 |O| ln |G| + ln(2|A|) + ln(1/τ ) , N with probability at least 1 − τ . [sent-155, score-0.77]
58 In contrast to the standard VI backup operator H, which performs backup at every ˆ point in B, the operator HB applies MC-BACKUP(Gm , b, N ) on a policy graph Gm at every point ˆ in B. [sent-160, score-0.613]
59 HB then produces a new policy graph Gm+1 by adding the new policy graph nodes to the previous policy graph Gm . [sent-162, score-1.107]
60 Let V0 be value function for some initial policy graph and Vm+1 = HB Vm . [sent-164, score-0.398]
61 5 (a) (b) (c) Figure 1: (a) Underwater Navigation: A reduced map with a 11 × 12 grid is shown with “S” marking the possible initial positions, “D” marking the destinations, “R” marking the rocks and “O” marking the locations where the robot can localize completely. [sent-173, score-0.453]
62 (b) Collaborative search and capture: Two robotic agents catching 12 escaped crocodiles in a 21 × 21 grid. [sent-174, score-0.274]
63 (c) Vehicular ad-hoc networking: An UAV maintains ad-hoc network over four ground vehicles in a 10 × 10 grid with “B” marking the base and “D” the destinations. [sent-175, score-0.178]
64 Underwater Navigation: The underwater navigation task was introduced in [9]. [sent-188, score-0.211]
65 In this task, an autonomous underwater vehicle (AUV) navigates in an environment modeled as 51 x 52 grid map. [sent-189, score-0.394]
66 The AUV needs to move from the left border to the right border while avoiding the rocks scattered near its destination. [sent-190, score-0.175]
67 The AUV has six actions: move north, move south, move east, move north-east, move south-east or stay in the same location. [sent-191, score-0.352]
68 We also define an additional macro-action that: navigates to the nearest goal location if the AUV position is known, or simply stays in the same location if the AUV position is not known. [sent-197, score-0.173]
69 To enable proper behaviour of the last macro-action, we augment the state space with a fully observable state variable that indicates the current AUV location. [sent-198, score-0.26]
70 This gives a simple example where the original state space is augmented with a fully observable state variable to allow more sophisticated macroaction behaviour. [sent-200, score-0.331]
71 6 Collaborative Search and Capture: In this problem, a group of crocodiles had escaped from its enclosure into the environment and two robotic agents have to collaborate to hunt down and capture the crocodiles (see Figure 1). [sent-202, score-0.322]
72 Both agents are centrally controlled and each agent can make a one step move in one of the four directions (north, south, east and west) or stay still at each time instance. [sent-203, score-0.259]
73 At every time instance, each crocodile moves to a location furthest from the agent that is nearest to it with a probability 1 − p (p = 0. [sent-205, score-0.172]
74 The agents do not know the exact location of the crocodiles, but each agent knows the number of crocodiles in the top left, top right, bottom left and bottom right quadrants around itself from the noise made by the crocodiles. [sent-209, score-0.252]
75 Vehicular Ad-hoc Network: In a post disaster search and rescue scenario, a group of rescue vehicles are deployed for operation work in an area where communication infrastructure has been destroyed. [sent-214, score-0.256]
76 The UAV needs to visit each vehicle as often as possible to pick up and deliver data packets [13]. [sent-217, score-0.215]
77 In this task, 4 rescue vehicles and 1 UAV navigates in a terrain modeled as a 10 x 10 grid map. [sent-218, score-0.253]
78 There are obstacles on the terrain that are impassable to ground vehicle but passable to UAV. [sent-219, score-0.213]
79 Upon reaching its destination, the vehicle may roam around the environment randomly while carrying out its mission. [sent-222, score-0.22]
80 The UAV knows its own location on the map and can observe the location of a vehicle if they are in the same grid square. [sent-223, score-0.336]
81 To elicit a policy with low network latency, there is a penalty of −0. [sent-224, score-0.311]
82 1× number of time steps since last visit of a vehicle for each time step for each vehicle. [sent-225, score-0.215]
83 There is a reward of 10 for each time a vehicle is visited by the UAV. [sent-226, score-0.292]
84 The state space consists of the vehicles’ locations, UAV location in the grid map and the number of time steps since each vehicle is last seen (for computing the reward). [sent-227, score-0.372]
85 We abstract the movements of UAV to search and visit a single vehicle as macro actions. [sent-228, score-0.302]
86 There are two kinds of search macro actions for each vehicle: search for a vehicle along its predefined path and search for a vehicle that has started to roam randomly. [sent-229, score-0.669]
87 Each macro-action is in turn hierarchically constructed by solving the simplified POMDP task of searching for a single vehicle on the same map using basic actions and some simple macro-actions that move along the paths. [sent-231, score-0.355]
88 We also compared with a state-of-the-art off-line POMDP solver, SARSOP [9], on the underwater navigation task. [sent-235, score-0.211]
89 We are not aware of other efficient off-line macroaction POMDP solvers that have been demonstrated on very large state space problems. [sent-244, score-0.197]
90 The underwater navigation task consist of two phases: the localization phase and navigate to goal phase. [sent-255, score-0.211]
91 Macro-MCVI’s policy takes one macro-action, “moving northeast until reaching the border”, to localize and another macro-action, “navigating to the goal”, to reach the goal. [sent-256, score-0.346]
92 Online-Macro does well, as the planning horizon is short with the use of macro-actions. [sent-258, score-0.276]
93 We first used MacroMCVI to solve for the policy that finds a single vehicle. [sent-293, score-0.311]
94 We then used the single-vehicle policy as a macro-action and solved for the higher-level policy that plans over the macro-actions. [sent-295, score-0.622]
95 Although it took substantial computation time, MacroMCVI generated a reasonable policy in the end. [sent-296, score-0.311]
96 To confirm that that the policy computed by Macro-MCVI at the higher level of the hierarchy is also effective, we manually crafted a greedy policy over the single-vehicle macro-actions. [sent-299, score-0.622]
97 This greedy policy always searches for the vehicle that has not been visited for the longest duration. [sent-300, score-0.495]
98 The experimental results indicate that the higher-level policy computed by Macro-MCVI is more effective than the greedy policy. [sent-301, score-0.311]
99 Motion planning under uncertainty for robotic tasks with long time horizons. [sent-374, score-0.288]
100 SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. [sent-383, score-0.377]
wordName wordTfidf (topN-words)
[('mcvi', 0.481), ('policy', 0.311), ('pomdp', 0.247), ('planning', 0.234), ('posmdp', 0.196), ('vehicle', 0.184), ('pomdps', 0.171), ('uav', 0.16), ('belief', 0.143), ('auv', 0.143), ('puma', 0.143), ('sarsop', 0.125), ('underwater', 0.125), ('backup', 0.122), ('reward', 0.108), ('si', 0.098), ('vm', 0.098), ('state', 0.096), ('primitive', 0.095), ('crocodiles', 0.094), ('navigation', 0.086), ('vg', 0.078), ('marking', 0.078), ('reweighted', 0.077), ('macroaction', 0.071), ('macromcvi', 0.071), ('rescue', 0.071), ('vehicular', 0.071), ('controller', 0.071), ('observable', 0.068), ('vehicles', 0.068), ('move', 0.063), ('east', 0.061), ('location', 0.06), ('termination', 0.059), ('hb', 0.059), ('graph', 0.058), ('crocodile', 0.058), ('south', 0.057), ('action', 0.056), ('actions', 0.055), ('robotic', 0.054), ('agent', 0.054), ('navigates', 0.053), ('hierarchically', 0.053), ('observation', 0.052), ('sj', 0.051), ('va', 0.051), ('north', 0.05), ('carlo', 0.048), ('abstraction', 0.048), ('history', 0.047), ('monte', 0.047), ('beacon', 0.047), ('robotics', 0.047), ('search', 0.046), ('agents', 0.044), ('gm', 0.044), ('hsu', 0.043), ('beliefs', 0.043), ('ra', 0.043), ('horizon', 0.042), ('node', 0.041), ('hv', 0.041), ('macro', 0.041), ('pineau', 0.041), ('spaces', 0.039), ('border', 0.038), ('singapore', 0.038), ('robot', 0.038), ('stay', 0.037), ('controllers', 0.037), ('destinations', 0.036), ('escaped', 0.036), ('hvg', 0.036), ('posmdps', 0.036), ('roam', 0.036), ('rocks', 0.036), ('unmanned', 0.036), ('localize', 0.035), ('west', 0.034), ('collaborative', 0.033), ('ln', 0.032), ('iteration', 0.032), ('online', 0.032), ('grid', 0.032), ('aircraft', 0.031), ('kurniawati', 0.031), ('visit', 0.031), ('started', 0.031), ('mdps', 0.03), ('solvers', 0.03), ('value', 0.029), ('terrain', 0.029), ('avoidance', 0.029), ('destination', 0.029), ('temporal', 0.028), ('particle', 0.028), ('retain', 0.028), ('partially', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
2 0.26604065 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
3 0.2243984 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
Author: Joni K. Pajarinen, Jaakko Peltonen
Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.
4 0.19379093 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
Author: João V. Messias, Matthijs Spaan, Pedro U. Lima
Abstract: Factored Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) form a powerful framework for multiagent planning under uncertainty, but optimal solutions require a rigid history-based policy representation. In this paper we allow inter-agent communication which turns the problem in a centralized Multiagent POMDP (MPOMDP). We map belief distributions over state factors to an agent’s local actions by exploiting structure in the joint MPOMDP policy. The key point is that when sparse dependencies between the agents’ decisions exist, often the belief over its local state factors is sufficient for an agent to unequivocally identify the optimal action, and communication can be avoided. We formalize these notions by casting the problem into convex optimization form, and present experimental results illustrating the savings in communication that we can obtain.
5 0.19040303 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
6 0.18916936 215 nips-2011-Policy Gradient Coagent Networks
7 0.16045538 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
8 0.13160369 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
9 0.12380784 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
10 0.12205539 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
11 0.11164048 41 nips-2011-Autonomous Learning of Action Models for Planning
12 0.10791271 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
13 0.10388498 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
14 0.10364238 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
15 0.095090784 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
16 0.090663642 283 nips-2011-The Fixed Points of Off-Policy TD
17 0.089018233 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery
18 0.088134684 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
19 0.074331366 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
20 0.068208426 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
topicId topicWeight
[(0, 0.178), (1, -0.183), (2, 0.084), (3, 0.254), (4, -0.256), (5, 0.037), (6, -0.045), (7, 0.007), (8, -0.026), (9, -0.049), (10, 0.011), (11, 0.005), (12, -0.102), (13, -0.054), (14, -0.089), (15, -0.149), (16, 0.165), (17, 0.136), (18, 0.045), (19, -0.015), (20, 0.047), (21, -0.056), (22, 0.036), (23, 0.027), (24, 0.066), (25, 0.018), (26, -0.042), (27, 0.059), (28, 0.026), (29, -0.05), (30, 0.006), (31, -0.004), (32, -0.008), (33, -0.001), (34, 0.066), (35, 0.032), (36, 0.039), (37, 0.01), (38, -0.046), (39, -0.023), (40, 0.048), (41, 0.046), (42, -0.032), (43, -0.019), (44, -0.008), (45, 0.037), (46, -0.037), (47, 0.002), (48, 0.059), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.95356113 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
2 0.91264766 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
Author: Joni K. Pajarinen, Jaakko Peltonen
Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.
3 0.80739242 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
4 0.78397864 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
5 0.7541343 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern
Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper defines a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both offline and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1
6 0.72762704 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
7 0.70514393 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
8 0.70290095 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
9 0.67947233 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
10 0.65476424 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
11 0.63382995 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
12 0.57268083 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
13 0.52320951 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery
14 0.51948977 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
15 0.49531814 256 nips-2011-Solving Decision Problems with Limited Information
16 0.47810709 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
17 0.45990318 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
18 0.44274187 41 nips-2011-Autonomous Learning of Action Models for Planning
19 0.41860369 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
20 0.40873492 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
topicId topicWeight
[(0, 0.04), (4, 0.041), (20, 0.029), (26, 0.015), (31, 0.125), (33, 0.012), (37, 0.342), (43, 0.042), (45, 0.064), (57, 0.042), (74, 0.058), (83, 0.034), (99, 0.061)]
simIndex simValue paperId paperTitle
1 0.83243722 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison
Abstract: We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. 1
same-paper 2 0.74910688 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
3 0.52266467 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
Author: Joni K. Pajarinen, Jaakko Peltonen
Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.
4 0.49516419 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
Author: João V. Messias, Matthijs Spaan, Pedro U. Lima
Abstract: Factored Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) form a powerful framework for multiagent planning under uncertainty, but optimal solutions require a rigid history-based policy representation. In this paper we allow inter-agent communication which turns the problem in a centralized Multiagent POMDP (MPOMDP). We map belief distributions over state factors to an agent’s local actions by exploiting structure in the joint MPOMDP policy. The key point is that when sparse dependencies between the agents’ decisions exist, often the belief over its local state factors is sufficient for an agent to unequivocally identify the optimal action, and communication can be avoided. We formalize these notions by casting the problem into convex optimization form, and present experimental results illustrating the savings in communication that we can obtain.
5 0.45344415 75 nips-2011-Dynamical segmentation of single trials from population neural data
Author: Biljana Petreska, Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani
Abstract: Simultaneous recordings of many neurons embedded within a recurrentlyconnected cortical network may provide concurrent views into the dynamical processes of that network, and thus its computational function. In principle, these dynamics might be identified by purely unsupervised, statistical means. Here, we show that a Hidden Switching Linear Dynamical Systems (HSLDS) model— in which multiple linear dynamical laws approximate a nonlinear and potentially non-stationary dynamical process—is able to distinguish different dynamical regimes within single-trial motor cortical activity associated with the preparation and initiation of hand movements. The regimes are identified without reference to behavioural or experimental epochs, but nonetheless transitions between them correlate strongly with external events whose timing may vary from trial to trial. The HSLDS model also performs better than recent comparable models in predicting the firing rate of an isolated neuron based on the firing rates of others, suggesting that it captures more of the “shared variance” of the data. Thus, the method is able to trace the dynamical processes underlying the coordinated evolution of network activity in a way that appears to reflect its computational role. 1
6 0.45096326 249 nips-2011-Sequence learning with hidden units in spiking neural networks
7 0.45034769 241 nips-2011-Scalable Training of Mixture Models via Coresets
8 0.45029229 229 nips-2011-Query-Aware MCMC
9 0.4484854 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
10 0.44689739 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
11 0.44472638 102 nips-2011-Generalised Coupled Tensor Factorisation
12 0.44389999 301 nips-2011-Variational Gaussian Process Dynamical Systems
13 0.44378948 221 nips-2011-Priors over Recurrent Continuous Time Processes
14 0.44355214 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
15 0.44322184 66 nips-2011-Crowdclustering
16 0.44194514 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
17 0.44084984 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
18 0.4404088 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
20 0.43951333 180 nips-2011-Multiple Instance Filtering