nips nips2003 nips2003-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Georgios Theocharous, Leslie P. Kaelbling
Abstract: Recent research has demonstrated that useful POMDP solutions do not require consideration of the entire belief space. We extend this idea with the notion of temporal abstraction. We present and explore a new reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. We apply the algorithm to a large scale robot navigation task and demonstrate that with temporal abstraction we can consider an even smaller part of the belief space, we can learn POMDP policies faster, and we can do information gathering more efficiently.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Recent research has demonstrated that useful POMDP solutions do not require consideration of the entire belief space. [sent-5, score-0.428]
2 We present and explore a new reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. [sent-7, score-0.482]
3 We apply the algorithm to a large scale robot navigation task and demonstrate that with temporal abstraction we can consider an even smaller part of the belief space, we can learn POMDP policies faster, and we can do information gathering more efficiently. [sent-8, score-0.776]
4 1 Introduction A popular approach to artificial intelligence is to model an agent and its interaction with its environment through actions, perceptions, and rewards [10]. [sent-9, score-0.202]
5 Intelligent agents should choose actions after every perception, such that their long-term reward is maximized. [sent-10, score-0.264]
6 Unfortunately solving POMDPs is an intractable problem mainly due to the fact that exact solutions rely on computing a policy over the entire belief-space [6, 3], which is a simplex of dimension equal to the number of states in the underlying Markov decision process (MDP). [sent-12, score-0.318]
7 Recently researchers have proposed algorithms that take advantage of the fact that for most POMDP problems, a large proportion of the belief space is not experienced [7, 9]. [sent-13, score-0.462]
8 In this paper we explore the same idea, but in combination with the notion of temporally extended actions (macro-actions). [sent-14, score-0.178]
9 We propose and investigate a new model-based reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. [sent-15, score-0.482]
10 We apply our algorithm to large scale robot navigation and demonstrate the various advantages of macro-actions in POMDPs. [sent-16, score-0.261]
11 Our experimental results show that with macro-actions an agent experiences a significantly smaller part of the belief space than with simple primitive actions. [sent-17, score-0.849]
12 In addition, learning is faster because an agent can look further into the future and propagate values of belief points faster. [sent-18, score-0.598]
13 And finally, well designed macros, such as macros that can easily take an agent from a high entropy belief state to a low entropy belief state (e. [sent-19, score-1.308]
14 2 POMDP Planning with Macros We now describe our algorithm for finding an approximately optimal plan for a known POMDP with macro actions. [sent-22, score-0.391]
15 It works by using a dynamically-created finite-grid approximation to the belief space, and then using model-based reinforcement learning to compute a value function at the grid points. [sent-23, score-0.7]
16 Our algorithm takes as input a POMDP model, a resolution r, and a set of macro-actions (described as policies or finite state automata). [sent-24, score-0.451]
17 The output is a set of grid-points (in belief space) and their associated action-values, which via interpolation specify an action-value function over the entire belief space, and therefore a complete policy for the POMDP. [sent-25, score-0.979]
18 Dynamic Grid Approximation A standard method of finding approximate solutions to POMDPs is to discretize the belief space by covering it with a uniformly-spaced grid (otherwise called regular grid as shown in Figure 1, then solve an MDP that takes those grid points as states [1]. [sent-26, score-1.424]
19 Unfortunately, the number of grid points required rises exponentially in the number of dimensions in the belief space, which corresponds to the number of states in the original space. [sent-27, score-0.762]
20 Recent studies have shown that in many cases, an agent actually travels through a very small subpart of its entire belief space. [sent-28, score-0.554]
21 Roy and Gordon [9] find a low-dimensional subspace of the original belief space, then discretize that uniformly to get an MDP approximation to the original POMDP. [sent-29, score-0.436]
22 5,0) (0,0,1) S1 (1,0,0) S3 S1 S3 RESOLUTION 1 RESOLUTION 2 S1 S3 RESOLUTION 4 Figure 1: The figure depicts various regular dicretizations of a 3 dimensional belief simplex. [sent-35, score-0.482]
23 The belief-space is the surface of the triangle, while grid points are the intersection of the lines drawn within the triangles. [sent-36, score-0.296]
24 Using resolution of powers of 2 allows finer discretizations to include the points of coarser dicretizations. [sent-37, score-0.368]
25 In our work, we allocate grid points from a uniformly-spaced grid dynamically by simulating trajectories of the agent through the belief space. [sent-38, score-1.09]
26 At each belief state experienced, we find the grid point that is closest to that belief state and add it to the set of grid points that we explicitly consider. [sent-39, score-1.637]
27 In this way, we develop a set of grid points that is typically a very small subset of the entire possible grid, which is adapted to the parts of the belief space typically inhabited by the agent. [sent-40, score-0.752]
28 In particular, given a grid resolution r and a belief state b we can compute the coordinates (grid points gi ) of the belief simplex that contains b using an efficient method called Freudenthal triangulation [2]. [sent-41, score-1.816]
29 In addition to the vertices of a sub-simplex, Freundenthal triangulation also produces barycentric coordinates λi , with respect to gi , which enable effective interpolation for the value of the belief state b from the values of the grid points gi [1]. [sent-42, score-1.574]
30 Using the barycentric coordinates we can also decide which is the closest grid-point to be added in the state space. [sent-43, score-0.371]
31 An SMDP is defined as a fivetuple (S,A,P ,R,F ), where S is a finite set of states, A is the set of actions, P is the state and action transition probability function, R is the reward function, and F is a function giving probability of transition times for each state-action pair. [sent-45, score-0.314]
32 Discretetime SMDPs represent transition distributions as F (s , N |s, a), which specifies the expected number of steps N that action a will take before terminating in state s starting in state s. [sent-48, score-0.493]
33 The Q-learning rule for discretetime discounted SMDPs is Qt+1 (s, a) ← (1 − β)Qt (s, a) + β R + γ k max Qt (s , a ) , a ∈A(s ) where β ∈ (0, 1), and action a was initiated in state s, lasted for k steps, and terminated in state s , while generating a total discounted sum of rewards of R. [sent-50, score-0.491]
34 For example, in a robot navigation task modeled as a POMDP, macro actions can consist of small state machines, such as a simple policy for driving down a corridor without hitting the walls until the end is reached. [sent-53, score-1.152]
35 Such actions may have the useful property of reducing the entropy of the belief space, by helping a robot to localize its position. [sent-54, score-0.73]
36 In addition, they relieve us of the burden of having to choose another primitive action based on the new belief state. [sent-55, score-0.782]
37 Using macro actions tends to reduce the number of belief states that are visited by the agent. [sent-56, score-0.978]
38 If a robot navigates largely by using macro-actions to move to important landmarks, it will never be necessary to model the belief states that are concerned with where the robot is within a corridor, for example. [sent-57, score-0.792]
39 Algorithm Our algorithm works by building a grid-based approximation of the belief space while executing a policy made up of macro actions. [sent-58, score-0.923]
40 The policy is determined by “solving” the finite MDP over the grid points. [sent-59, score-0.362]
41 Computing a policy over grid points equally spaced in the belief simplex, otherwise called regular discretization, is computationally intractable since the number of grid-points grows exponentially with the resolution [2]. [sent-60, score-1.141]
42 Nonetheless, the value of a belief point in a regular dicretization can be interpolated efficiently from the values of the neighboring grid-points [2]. [sent-61, score-0.565]
43 On the other hand, in variable resolution non-regular grids, interpolation can be computationally expensive [1]. [sent-62, score-0.296]
44 A better approach is variable resolution with regular dicretization which takes advantage of fast interpolation and increases resolution only in the necessary areas [12]. [sent-63, score-0.685]
45 Our approach falls in this last category with the addition of macro-actions, which exhibit various advantages over approaches using primitive actions only. [sent-64, score-0.456]
46 It works by generating trajectories through the belief space according to the current policy, with some added exploration. [sent-66, score-0.447]
47 Reinforcement learning using a model, otherwise called real time dynamic programming (RTDP) is not only better suited for huge spaces but in our case is also convenient in estimating the necessary models of our macro-actions over the experienced grid points. [sent-67, score-0.319]
48 This is the physical true location of the agent, and it should have support in the current belief state b (that is b(s) = 0). [sent-70, score-0.529]
49 Discretize the current belief state b → gi , where gi is the closest grid-point (with the maximum barycentric coordinate) in a regular discretization of the belief space. [sent-72, score-1.692]
50 If the resolution is 1 initialize its value to zero otherwise interpolate its initial value from coarser resolutions. [sent-74, score-0.37]
51 b g1 b’ g2 b’’ g g3 Figure 2: The agent finds itself at a belief state b. [sent-75, score-0.655]
52 It maps b to the grid point g, which has the largest barycentric coordinate among the sub-simplex coordinates that contain b. [sent-76, score-0.443]
53 Now, it needs to do a value backup for that grid point. [sent-77, score-0.283]
54 It chooses a macro action and executes it starting from the chosen grid-point, using the primitive actions and observations that it does along the way to update its belief state. [sent-78, score-1.429]
55 It needs to get a value estimate for the resulting belief state b . [sent-79, score-0.529]
56 It does so by using the barycentric coordinates from the grid to interpolate a value from nearby grid points g1, g2, and g3. [sent-80, score-0.762]
57 In case the nearest grid-point g i is missing, it is interpolated from coarser resolutions and added to the representation. [sent-81, score-0.177]
58 If the resolution is 1, the value of gi is initialized to zero. [sent-82, score-0.483]
59 The agent executes the macro-action from the same grid point g multiple times so that it can approximate the probability distribution over the resulting belief-states b . [sent-83, score-0.407]
60 Finally, it can update the estimated value of the grid point g and execute the macro-action chosen from the true belief state b. [sent-84, score-0.836]
61 The process repeats from the next true belief state b . [sent-85, score-0.529]
62 Estimate E [R(gi , µ) + γ t V (b )] by sampling: (a) Sample a state s from the current grid-belief state gi (which like all belief states represents a probability distribution over world states). [sent-90, score-0.978]
63 Choose the appropriate primitive action a according to macro-action µ. [sent-93, score-0.392]
64 Update the belief state: b (j) := α O(a, j, z) i∈S T (i, a, j), for all states j, where α is a normalizing factor. [sent-102, score-0.466]
65 (b) Compute the value of the resulting belief state b by interpolating over the |S|+1 vertices in the resulting belief sub-simplex: V (b ) = λi V (gi ). [sent-105, score-1.015]
66 If i the closest grid-point (with the maximum barycentric coordinate) is missing, interpolate it from coarser resolutions, and add it to the hash-table. [sent-106, score-0.281]
67 Update the state action value: Q(gi , µ) = (1 − β)Q(gi , µ) + β [R + γ t V (b )]. [sent-109, score-0.226]
68 Execute the macro-action µ starting from belief state b until termination. [sent-113, score-0.575]
69 During execution, generate observations by sampling the POMDP model, starting from the true state s. [sent-114, score-0.212]
70 3 Experimental Results We tested this algorithm by applying it to the problem of robot navigation, which is a classic sequential decision-making problem under uncertainty. [sent-118, score-0.193]
71 In such a representation, the robot can move through the different environment states by taking actions such as “go-forward”, “turn-left”, and “turn-right”. [sent-121, score-0.43]
72 In this navigation domain, our robot can only perceive sixteen possible observations, which indicate the presence of a wall and opening on the four sides of the robot. [sent-124, score-0.264]
73 The POMDP model of the corridor environment has a reward function with value -1 in every state, except for -100 for going forward into a wall and +100 for taking any action from the four-way junction. [sent-126, score-0.362]
74 When the average number of training steps stabilized we increased the resolution by multiplying it by 2. [sent-131, score-0.374]
75 Each training episode started from the uniform initial belief state and was terminated when the four-way junction was reached or when more than 200 steps were taken. [sent-133, score-0.761]
76 We compared the results with the QMDP heuristic which first solves the underlying MDP and then given any belief state, chooses the action that maximizes the dot product of the belief |S| and Q values of state action pairs: QM DPa = argmaxa s=1 b(s)Q(s, a). [sent-135, score-1.093]
77 An exception is when the resolution is 1, where training with only primitive actions requires a small number of steps too. [sent-137, score-0.83]
78 Nonetheless as we increase the resolution, training with primitive actions only does not scale well, because the number of states increases dramatically. [sent-138, score-0.575]
79 In general, the number of grid points used with or without macro-actions is significantly smaller than the total number of points allowed for regular dicretization. [sent-139, score-0.436]
80 For example, for a regular discretization the number of grid points can be computed by the formula given in [2], (r+|S|−1)! [sent-140, score-0.441]
81 actions uses only about about 3000 and with primitive actions only about 6500 grid points. [sent-145, score-0.855]
82 The sharp changes in the graph are due to the resolution increase. [sent-148, score-0.278]
83 We tested the policies that resulted from each algorithm by starting from a uniform initial belief state and a uniformly randomly chosen world state and simulating the greedy policy derived by interpolating the grid value function. [sent-149, score-1.269]
84 A run was considered a success if the robot was able to reach the goal in fewer than 200 steps. [sent-151, score-0.225]
85 Success qmdp primitive macro Success % 80 60 40 20 Testing Steps 140 primitive macro 120 Steps to goal 100 100 80 60 40 20 0 0 0 1 2 3 4 Resolution 0 1 2 3 4 Resolution Figure 5: The figure on the left shows the success percentage for the different methods during testing. [sent-152, score-1.514]
86 For the primitive-actions only algorithm we report the result for resolution 1 only, since it was as successful as the macro-action algorithm. [sent-155, score-0.304]
87 From Figure 5 we can conclude that the QMDP approach can never be 100% successful, while the primitive-actions algorithm can perform quite well with resolution 1 in this environment. [sent-156, score-0.279]
88 In addition, when we compared the average number of testing steps for resolution 1 the macro-action algorithm seems to have learned a better policy. [sent-158, score-0.361]
89 The macro-action policy policy seems to get worse for resolution 4 due to the increasing number of grid-points added in the repre- sentation. [sent-159, score-0.506]
90 Figure 6 shows that with primitive actions only, the algorithm fails completely. [sent-162, score-0.486]
91 Our algorithm is able to solve a difficult planning problem, namely the task of navigating to a goal in a huge space POMDP starting from a uniform initial belief, which is more difficult than many of the tasks that similar algorithms are tested on. [sent-168, score-0.208]
92 In general macro-actions in POMDPs allow us to experience a smaller part of the state space, backup values faster, and do information gathering. [sent-170, score-0.174]
93 As a result we can afford to allow for higher grid resolution which results in better performance. [sent-171, score-0.497]
94 We cannot do this with only primitive actions (unless we use reward shaping) and it is completely out of the question for exact solution over the entire regular grid. [sent-172, score-0.674]
95 On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision processes. [sent-187, score-0.174]
96 For the QMDP approach the robot starts from START with uniform initial belief. [sent-190, score-0.2]
97 After reaching J2 the belief becomes bi-modal concentrating on J1 and J2. [sent-191, score-0.417]
98 On the other hand, with our planning algorithm, the robot again starts from START and a uniform initial belief. [sent-193, score-0.267]
99 Upon reaching J2 the belief becomes bimodal over J1 and J2. [sent-194, score-0.417]
100 The agent resolves its uncertainty by deciding that the best action to take is the go-down-the-corridor macro, at which point it encounters J3 and localizes. [sent-195, score-0.213]
wordName wordTfidf (topN-words)
[('belief', 0.39), ('macro', 0.361), ('primitive', 0.305), ('resolution', 0.249), ('grid', 0.248), ('gi', 0.234), ('pomdp', 0.197), ('robot', 0.163), ('actions', 0.151), ('state', 0.139), ('agent', 0.126), ('barycentric', 0.125), ('qmdp', 0.12), ('policy', 0.114), ('corridor', 0.114), ('pomdps', 0.106), ('mdp', 0.102), ('regular', 0.092), ('reward', 0.088), ('action', 0.087), ('steps', 0.082), ('states', 0.076), ('macros', 0.072), ('smdp', 0.072), ('smdps', 0.072), ('coarser', 0.071), ('navigation', 0.068), ('shaping', 0.067), ('planning', 0.067), ('gathering', 0.063), ('interpolating', 0.063), ('reinforcement', 0.062), ('success', 0.062), ('oor', 0.053), ('discretization', 0.053), ('interpolate', 0.05), ('episodes', 0.05), ('decision', 0.048), ('dicretization', 0.048), ('nomad', 0.048), ('theocharous', 0.048), ('points', 0.048), ('interpolation', 0.047), ('starting', 0.046), ('discretize', 0.046), ('experienced', 0.044), ('qt', 0.044), ('coordinates', 0.043), ('training', 0.043), ('simplex', 0.042), ('resolutions', 0.042), ('discretetime', 0.042), ('walls', 0.042), ('environment', 0.04), ('entire', 0.038), ('episode', 0.038), ('uniform', 0.037), ('intelligence', 0.036), ('closest', 0.035), ('backup', 0.035), ('interpolated', 0.035), ('sixteenth', 0.035), ('faster', 0.034), ('executes', 0.033), ('roy', 0.033), ('topological', 0.033), ('wall', 0.033), ('triangulation', 0.033), ('policies', 0.033), ('gure', 0.033), ('observable', 0.033), ('vertices', 0.033), ('terminated', 0.032), ('argmax', 0.032), ('arti', 0.032), ('markov', 0.031), ('simulating', 0.03), ('execute', 0.03), ('algorithm', 0.03), ('cial', 0.03), ('graph', 0.029), ('abstraction', 0.029), ('added', 0.029), ('update', 0.029), ('repeat', 0.028), ('nonetheless', 0.028), ('space', 0.028), ('observations', 0.027), ('gordon', 0.027), ('reaching', 0.027), ('temporally', 0.027), ('missing', 0.027), ('dynamic', 0.027), ('coordinate', 0.027), ('discounted', 0.026), ('entropy', 0.026), ('partially', 0.026), ('lab', 0.025), ('agents', 0.025), ('successful', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
Author: Georgios Theocharous, Leslie P. Kaelbling
Abstract: Recent research has demonstrated that useful POMDP solutions do not require consideration of the entire belief space. We extend this idea with the notion of temporal abstraction. We present and explore a new reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. We apply the algorithm to a large scale robot navigation task and demonstrate that with temporal abstraction we can consider an even smaller part of the belief space, we can learn POMDP policies faster, and we can do information gathering more efficiently.
2 0.40365198 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
Author: Joelle Pineau, Geoffrey J. Gordon, Sebastian Thrun
Abstract: Recent developments in grid-based and point-based approximation algorithms for POMDPs have greatly improved the tractability of POMDP planning. These approaches operate on sets of belief points by individually learning a value function for each point. In reality, belief points exist in a highly-structured metric simplex, but current POMDP algorithms do not exploit this property. This paper presents a new metric-tree algorithm which can be used in the context of POMDP planning to sort belief points spatially, and then perform fast value function updates over groups of points. We present results showing that this approach can reduce computation in point-based POMDP algorithms for a wide range of problems. 1
3 0.21307616 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
4 0.1893198 42 nips-2003-Bounded Finite State Controllers
Author: Pascal Poupart, Craig Boutilier
Abstract: We describe a new approximation algorithm for solving partially observable MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).
5 0.1764545 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination
Author: Curt Bererton, Geoffrey J. Gordon, Sebastian Thrun
Abstract: The design of cooperative multi-robot systems is a highly active research area in robotics. Two lines of research in particular have generated interest: the solution of large, weakly coupled MDPs, and the design and implementation of market architectures. We propose a new algorithm which joins together these two lines of research. For a class of coupled MDPs, our algorithm automatically designs a market architecture which causes a decentralized multi-robot system to converge to a consistent policy. We can show that this policy is the same as the one which would be produced by a particular centralized planning algorithm. We demonstrate the new algorithm on three simulation examples: multi-robot towing, multi-robot path planning with a limited fuel resource, and coordinating behaviors in a game of paint ball. 1
6 0.16095364 158 nips-2003-Policy Search by Dynamic Programming
7 0.14801075 78 nips-2003-Gaussian Processes in Reinforcement Learning
8 0.14646687 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
9 0.14376576 62 nips-2003-Envelope-based Planning in Relational MDPs
10 0.14029106 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
11 0.12937763 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
12 0.12604265 68 nips-2003-Eye Movements for Reward Maximization
13 0.11890547 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
14 0.11383439 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
15 0.11287381 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
16 0.10958523 32 nips-2003-Approximate Expectation Maximization
17 0.10259564 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time
18 0.089917906 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales
19 0.087608352 26 nips-2003-An MDP-Based Approach to Online Mechanism Design
20 0.085440412 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
topicId topicWeight
[(0, -0.241), (1, 0.361), (2, -0.207), (3, 0.136), (4, -0.106), (5, -0.141), (6, -0.095), (7, -0.025), (8, 0.115), (9, 0.025), (10, 0.074), (11, -0.065), (12, 0.054), (13, 0.079), (14, 0.032), (15, 0.051), (16, 0.1), (17, 0.027), (18, 0.075), (19, 0.05), (20, -0.03), (21, 0.048), (22, 0.038), (23, -0.044), (24, 0.219), (25, 0.063), (26, -0.13), (27, 0.061), (28, -0.051), (29, 0.102), (30, 0.003), (31, 0.072), (32, -0.048), (33, -0.073), (34, 0.012), (35, 0.012), (36, 0.129), (37, -0.043), (38, 0.142), (39, 0.051), (40, 0.086), (41, -0.005), (42, -0.028), (43, 0.046), (44, -0.095), (45, 0.054), (46, 0.038), (47, 0.13), (48, -0.013), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.97916287 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
Author: Georgios Theocharous, Leslie P. Kaelbling
Abstract: Recent research has demonstrated that useful POMDP solutions do not require consideration of the entire belief space. We extend this idea with the notion of temporal abstraction. We present and explore a new reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. We apply the algorithm to a large scale robot navigation task and demonstrate that with temporal abstraction we can consider an even smaller part of the belief space, we can learn POMDP policies faster, and we can do information gathering more efficiently.
2 0.83658767 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
Author: Joelle Pineau, Geoffrey J. Gordon, Sebastian Thrun
Abstract: Recent developments in grid-based and point-based approximation algorithms for POMDPs have greatly improved the tractability of POMDP planning. These approaches operate on sets of belief points by individually learning a value function for each point. In reality, belief points exist in a highly-structured metric simplex, but current POMDP algorithms do not exploit this property. This paper presents a new metric-tree algorithm which can be used in the context of POMDP planning to sort belief points spatially, and then perform fast value function updates over groups of points. We present results showing that this approach can reduce computation in point-based POMDP algorithms for a wide range of problems. 1
3 0.72637922 42 nips-2003-Bounded Finite State Controllers
Author: Pascal Poupart, Craig Boutilier
Abstract: We describe a new approximation algorithm for solving partially observable MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).
4 0.57869965 14 nips-2003-A Nonlinear Predictive State Representation
Author: Matthew R. Rudary, Satinder P. Singh
Abstract: Predictive state representations (PSRs) use predictions of a set of tests to represent the state of controlled dynamical systems. One reason why this representation is exciting as an alternative to partially observable Markov decision processes (POMDPs) is that PSR models of dynamical systems may be much more compact than POMDP models. Empirical work on PSRs to date has focused on linear PSRs, which have not allowed for compression relative to POMDPs. We introduce a new notion of tests which allows us to define a new type of PSR that is nonlinear in general and allows for exponential compression in some deterministic dynamical systems. These new tests, called e-tests, are related to the tests used by Rivest and Schapire [1] in their work with the diversity representation, but our PSR avoids some of the pitfalls of their representation—in particular, its potential to be exponentially larger than the equivalent POMDP. 1
5 0.5641464 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination
Author: Curt Bererton, Geoffrey J. Gordon, Sebastian Thrun
Abstract: The design of cooperative multi-robot systems is a highly active research area in robotics. Two lines of research in particular have generated interest: the solution of large, weakly coupled MDPs, and the design and implementation of market architectures. We propose a new algorithm which joins together these two lines of research. For a class of coupled MDPs, our algorithm automatically designs a market architecture which causes a decentralized multi-robot system to converge to a consistent policy. We can show that this policy is the same as the one which would be produced by a particular centralized planning algorithm. We demonstrate the new algorithm on three simulation examples: multi-robot towing, multi-robot path planning with a limited fuel resource, and coordinating behaviors in a game of paint ball. 1
6 0.50923294 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
7 0.50640517 2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality
8 0.48313645 158 nips-2003-Policy Search by Dynamic Programming
9 0.46222892 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
10 0.45832488 32 nips-2003-Approximate Expectation Maximization
11 0.45744792 68 nips-2003-Eye Movements for Reward Maximization
12 0.44499674 62 nips-2003-Envelope-based Planning in Relational MDPs
13 0.44288468 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
14 0.41445342 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
15 0.40306324 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
16 0.40256178 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
17 0.38043866 78 nips-2003-Gaussian Processes in Reinforcement Learning
18 0.38020885 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time
19 0.35840726 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
20 0.35508651 21 nips-2003-An Autonomous Robotic System for Mapping Abandoned Mines
topicId topicWeight
[(0, 0.123), (11, 0.012), (29, 0.302), (30, 0.034), (35, 0.075), (53, 0.06), (71, 0.045), (76, 0.039), (85, 0.07), (91, 0.122), (99, 0.029)]
simIndex simValue paperId paperTitle
1 0.92797142 130 nips-2003-Model Uncertainty in Classical Conditioning
Author: Aaron C. Courville, Geoffrey J. Gordon, David S. Touretzky, Nathaniel D. Daw
Abstract: We develop a framework based on Bayesian model averaging to explain how animals cope with uncertainty about contingencies in classical conditioning experiments. Traditional accounts of conditioning fit parameters within a fixed generative model of reinforcer delivery; uncertainty over the model structure is not considered. We apply the theory to explain the puzzling relationship between second-order conditioning and conditioned inhibition, two similar conditioning regimes that nonetheless result in strongly divergent behavioral outcomes. According to the theory, second-order conditioning results when limited experience leads animals to prefer a simpler world model that produces spurious correlations; conditioned inhibition results when a more complex model is justified by additional experience. 1
same-paper 2 0.85798222 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
Author: Georgios Theocharous, Leslie P. Kaelbling
Abstract: Recent research has demonstrated that useful POMDP solutions do not require consideration of the entire belief space. We extend this idea with the notion of temporal abstraction. We present and explore a new reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. We apply the algorithm to a large scale robot navigation task and demonstrate that with temporal abstraction we can consider an even smaller part of the belief space, we can learn POMDP policies faster, and we can do information gathering more efficiently.
3 0.75171751 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
Author: David Hoyle, Magnus Rattray
Abstract: We derive the limiting form of the eigenvalue spectrum for sample covariance matrices produced from non-isotropic data. For the analysis of standard PCA we study the case where the data has increased variance along a small number of symmetry-breaking directions. The spectrum depends on the strength of the symmetry-breaking signals and on a parameter α which is the ratio of sample size to data dimension. Results are derived in the limit of large data dimension while keeping α fixed. As α increases there are transitions in which delta functions emerge from the upper end of the bulk spectrum, corresponding to the symmetry-breaking directions in the data, and we calculate the bias in the corresponding eigenvalues. For kernel PCA the covariance matrix in feature space may contain symmetry-breaking structure even when the data components are independently distributed with equal variance. We show examples of phase-transition behaviour analogous to the PCA results in this case. 1
4 0.61065352 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
Author: Milos Hauskrecht, Branislav Kveton
Abstract: Approximate linear programming (ALP) has emerged recently as one of the most promising methods for solving complex factored MDPs with finite state spaces. In this work we show that ALP solutions are not limited only to MDPs with finite state spaces, but that they can also be applied successfully to factored continuous-state MDPs (CMDPs). We show how one can build an ALP-based approximation for such a model and contrast it to existing solution methods. We argue that this approach offers a robust alternative for solving high dimensional continuous-state space problems. The point is supported by experiments on three CMDP problems with 24-25 continuous state factors. 1
5 0.60581452 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
Author: Maneesh Sahani
Abstract: Significant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a significant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1
6 0.60325301 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays
7 0.59506387 78 nips-2003-Gaussian Processes in Reinforcement Learning
8 0.59406865 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
9 0.59277177 12 nips-2003-A Model for Learning the Semantics of Pictures
10 0.58693767 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
11 0.58425647 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process
12 0.5828141 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
13 0.58208424 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
14 0.58040857 66 nips-2003-Extreme Components Analysis
15 0.57326734 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
16 0.57277524 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
17 0.56925523 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
18 0.56707644 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
19 0.56389207 158 nips-2003-Policy Search by Dynamic Programming
20 0.56382263 172 nips-2003-Semi-Supervised Learning with Trees