nips nips2003 nips2003-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pascal Poupart, Craig Boutilier
Abstract: We describe a new approximation algorithm for solving partially observable MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).
Reference: text
sentIndex sentText sentNum sentScore
1 Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima). [sent-6, score-1.092]
2 1 Introduction Finite state controllers (FSCs) provide a simple, convenient way of representing policies for partially observable Markov decision processes (POMDPs). [sent-7, score-0.449]
3 Two general approaches are often used to construct good controllers: policy iteration (PI) [7] and gradient ascent (GA) [10, 11, 1]. [sent-8, score-0.291]
4 The former is guaranteed to converge to an optimal policy, however, the size of the controller often grows intractably. [sent-9, score-0.297]
5 In contrast, the latter restricts its search to controllers of a bounded size, but may get trapped in a local optimum. [sent-10, score-0.357]
6 Consider a system engaged in preference elicitation, charged with discovering optimal query policy to determine relevant aspects of a user’s utility function. [sent-12, score-0.252]
7 If each question has a cost, a system that locally optimizes the policy by GA may determine that the best course of action is to ask no questions (i. [sent-14, score-0.311]
8 When an optimal policy consists of a sequence of actions any small perturbation to which results in a bad policy, there is little hope of finding this sequence using methods that greedily perform local perturbations such as those employed by GA. [sent-17, score-0.32]
9 In general, we would like the best of both worlds: bounded controller size and convergence to a global optimum. [sent-18, score-0.33]
10 While achieving both is NP-hard for the class of deterministic controllers [10], one can hope for a tractable algorithm that at least avoids obvious local optima. [sent-19, score-0.279]
11 We propose a new anytime algorithm, bounded policy iteration (BPI) that improves a policy much like Hansen’s PI [7] while keeping the size of the controller fixed. [sent-20, score-0.954]
12 Whenever the algorithm gets stuck in a local optimum, the controller is allowed to slightly grow by introducing one (or a few) node(s) to escape the local optimum. [sent-21, score-0.595]
13 5), relate this analysis to GA, and use it to justify a new method to escape local optima. [sent-28, score-0.23]
14 Since states are not directly observable in POMDPs with discount factor POMDPs, we define a belief state to be a distribution over states. [sent-33, score-0.449]
15 FEC&B @ 8 6 4 A9753 B Policies represented by FSCs are defined by a (possibly cyclic) directed graph , where each node is labeled by an action and each edge by an observation . [sent-39, score-0.305]
16 The FSC can be viewed as a policy , where action strategy associates each node with an action , and observation strategy associates each node and observation with a successor node (corresponding to the edge from labeled with ). [sent-41, score-1.139]
17 A policy is executed by taking the action associated with the “current node,” and updating the current node by following the edge labeled by the observation made. [sent-42, score-0.557]
18 R' 6iWiHY¤g¨'&¤21yDqx¨rYH¤ v u P (1) is simply the expectation Given an initial belief state , an FSC’s value at node ; the best starting node for a given is determined by . [sent-44, score-0.769]
19 As a result, the value of each node is linear with respect to the belief state; hence the value function of the controller is piecewise-linear and convex. [sent-45, score-0.833]
20 1(a), each linear segment corresponds to the value function of a node and the upper surface of these segments forms the controller value function. [sent-47, score-0.613]
21 x&21 E e§¤ d u B ¤ u ¨ " ¤ 6 ¨ B ¤ f D B (2) Policy iteration (PI) [7] incrementally improves a controller by alternating between two steps, policy improvement and policy evaluation, until convergence to an optimal policy. [sent-51, score-0.929]
22 Policy improvement adds nodes to the controller by dynamic programming (DP) and removes other nodes. [sent-54, score-0.471]
23 2(a)) of the current controller to obtain a new, improved value function ( in Fig. [sent-60, score-0.365]
24 Each linear segment of corresponds to a new node added to the controller. [sent-62, score-0.232]
25 After the new nodes created by DP have been added, old nodes that are now pointwise dominated are removed. [sent-64, score-0.498]
26 A node is pointwise dominated when its value is less than that of some other node at all belief states (e. [sent-65, score-1.029]
27 The inward edges of a pointwise dominated node are re-directed to the dominating node since it offers better value (e. [sent-69, score-0.873]
28 The controller resulting from this policy improvement step is guaranteed to offer higher value at all belief states. [sent-73, score-0.91]
29 On the other hand, up to new nodes may be added with each DP backup, so the size of the controller quickly becomes intractable in many POMDPs. [sent-74, score-0.405]
30 This is because the algorithm is designed to produce controllers with deterministic observation strategies. [sent-76, score-0.258]
31 A pointwise-dominated node can safely be pruned since its inward arcs are redirected to the dominating node (which has value at least as high as the dominated in node at each state). [sent-77, score-1.196]
32 In contrast, a node jointly dominated by several nodes (e. [sent-78, score-0.515]
33 2(b) is jointly dominated by and ) cannot be pruned without its inward arcs being redirected to different nodes depending on the current belief state. [sent-81, score-0.876]
34 If the stochastic strategy is chosen carefully, the corresponding convex combination of dominating nodes may pointwise dominate the node we would like to prune. [sent-84, score-0.715]
35 The dotted line illustrates one convex combination of and that pointwise dominates : consequently, can be safely removed and its inward arcs re-directed to reflect this convex combination by setting the observation probabilities accordingly. [sent-88, score-0.708]
36 In general, when a node is jointly dominated by a group of nodes, there exists a pointwise-dominating convex combination of this group. [sent-89, score-0.543]
37 %iW¨ r Ygh D Y (¨ ¤ ¢ ¡ Y Y ¡ Y ¢ £Y hY ¡ Y (¨ rY iY h Y hY ¢ ¥Y ¡ Y ¢ Y Theorem 1 The value function of a node is jointly dominated by the value functions of nodes if and only if there is a convex combination that dominates . [sent-91, score-0.776]
38 This LP finds the belief state that minimizes the difference between and the max of . [sent-99, score-0.329]
39 It turns out that the dual LP (Table 2) finds the most dominating convex combination parallel to . [sent-100, score-0.227]
40 1, the LP in Table 1 gives us an algorithm to find the most dominating convex combination parallel to a dominated node. [sent-103, score-0.36]
41 In summary, by considering stochastic controllers, we can extend PI to prune all dominated nodes (pointwise or jointly) in the policy improvement step. [sent-104, score-0.645]
42 This provides two advantages: controllers can be made smaller while improving their decision quality. [sent-105, score-0.234]
43 4 Bounded Policy Iteration Although pruning all dominated nodes helps to keep the controller small, it may still grow substantially with each DP backup. [sent-106, score-0.606]
44 Feng and Hansen [6] proposed that one prunes all nodes that dominate the value function by less than some after each DP backup. [sent-108, score-0.234]
45 Alternatively, instead of growing the controller with each backup and then pruning, we can do a partial DP backup that generates only a subset of the nodes using Cheng’s algorithm [5], the witness algorithm [9], or other heuristics [14]. [sent-109, score-0.573]
46 In order to keep the controller bounded, for each node created in a partial DP backup, one node must be pruned and its inward arcs redirected to some dominating convex combination. [sent-110, score-1.148]
47 In the event where no node is dominated, we can still prune a node and redirect its arcs to a good convex combination, but the resulting controller may have lesser value at some belief states. [sent-111, score-1.208]
48 We now propose a new algorithm called bounded policy iteration (BPI) that guarantees monotonic value improvement at all belief states while keeping the number of nodes fixed. [sent-112, score-0.92]
49 ¤ BPI considers one node at a time and tries to improve it while keeping all other nodes fixed. [sent-113, score-0.373]
50 Improvement is achieved by replacing each node by a good convex combination of the nodes normally created by a DP backup, but without actually performing a backup. [sent-114, score-0.443]
51 Since the backed up value function must dominate the controller’s current value function, then by Thm. [sent-115, score-0.393]
52 1 there must exist a convex combination of the backed up nodes that pointwise dominates each node of the controller. [sent-116, score-0.856]
53 Table 3: Naive LP to find a convex combination of backed up nodes that dominate . [sent-130, score-0.553]
54 Table 4: Efficient LP to find a convex combination of backed up nodes that dominate . [sent-136, score-0.553]
55 Note reaching node that we now use probabilistic action strategies and have extended probabilistic observation strategies to depend on the action executed. [sent-148, score-0.422]
56 r Yf ¨ ¤ f f ) f To summarize, BPI alternates between policy evaluation and improvement as in regular PI, but the policy improvement step simply tries to improve each node by solving the LP in Table 4. [sent-150, score-0.867]
57 We now characterize BPI’s local optima and propose a method to escape them. [sent-154, score-0.365]
58 Intuitively, a controller is a local optimum when each linear segment touches from below, or is tangent to, the controller’s backed up value function (see Fig. [sent-158, score-0.969]
59 Theorem 2 BPI has converged to a local optimum if and only if each node’s value function is tangent to the backed up value function. [sent-160, score-0.681]
60 Proof: Since the objective function of the LP in Table 4 seeks to maximize the improvement , the resulting convex combination must be tangent to the upper surface of the backed up value function. [sent-161, score-0.749]
61 Conversely, the only time when the LP won’t be able to improve a node is when its vector is already tangent to the backed up value function. [sent-162, score-0.751]
62 Corollary 1 If GA has converged to a local optimum, then the value function of each node reachable from the initial belief state is tangent to the backed up value function. [sent-166, score-1.271]
63 Proof: GA seeks to monotonically improve a controller in the direction of steepest ascent. [sent-167, score-0.382]
64 Thus if BPI can improve a controller by finding a direction of improvement using the LP of Table 4, then GA will also find it or will find a steeper one. [sent-169, score-0.395]
65 Conversely, when a controller is a local optimum for GA, then there is no monotonic improvement possible in any direction. [sent-170, score-0.521]
66 Since BPI can only improve a controller by following a direction of monotonic improvement, GA’s local optima are a subset of BPI’s local optima. [sent-171, score-0.639]
67 In the proof of Corollary 1, we argued that GA’s local optima are a subset of BPI’s local optima. [sent-173, score-0.294]
68 This suggests that BPI is inferior to GA since it can be trapped by more local optima than GA. [sent-174, score-0.248]
69 However we will describe in the next section a simple technique that allows BPI to easily escape from local optima. [sent-175, score-0.23]
70 2 Escape Technique The tangency condition characterizing local optima can be used to design an effective escape method for BPI. [sent-177, score-0.414]
71 It essentially tells us that such tangent belief states are “bottlenecks” for further policy improvement. [sent-178, score-0.784]
72 If we could improve the value at the tangent belief state(s) of some node, then we could break out of the local optimum. [sent-179, score-0.62]
73 A simple method for doing so consists of a one-step lookahead search from the tangent belief states. [sent-180, score-0.506]
74 Figure 1(b) illustrates how belief state can be reached in one step from tangent belief state , and how the backed up value function improves ’s current value. [sent-181, score-1.201]
75 Thus, if we add a node to the controller that maximizes the value of , its improved value can subsequently be backed up to the tangent belief state , breaking out of the local optimum. [sent-182, score-1.481]
76 B B B B B Our algorithm is summarized as follows: perform a one-step lookahead search from each tangent belief state; when a reachable belief state can be improved, add a new node to the controller that maximizes that belief state’s value. [sent-183, score-1.697]
77 Interestingly, when no reachable belief state can be improved, the policy must be optimal at the tangent belief states. [sent-184, score-1.172]
78 Theorem 3 If the backed up value function does not improve the value of any belief state reachable in one step from any tangent belief state, then the policy is optimal at the tangent belief states. [sent-185, score-2.019]
79 Proof: By definition, belief states for which the backed up value function provides no improvement are tangent belief states. [sent-186, score-1.146]
80 Hence, when all belief states reachable in one step are themselves tangent belief states, then the set of tangent belief states is closed under every policy. [sent-187, score-1.43]
81 Since there is no possibility of improvement, the current policy must be optimal at the tangent belief states. [sent-188, score-0.73]
82 Although Thm 3 guarantees an optimal solution only at the tangent belief states, in practice, they rarely form a proper subset of the belief space (when none of the reachable belief states can be improved). [sent-189, score-1.151]
83 Note also that the escape algorithm assumes knowledge of the tangent belief states. [sent-190, score-0.64]
84 Fortunately, the solution to the dual of the LP in Table 4 is a tangent belief state. [sent-191, score-0.508]
85 Since most commercial LP solvers return both the solution of the primal and dual, a tangent belief state is readily available for each node. [sent-192, score-0.614]
86 6 Experiments We report some preliminary experiments with BPI and the escape method to assess their robustness against local optima, as well as their scalability to relatively large POMDPs. [sent-194, score-0.23]
87 In a second experiment, we report the running time and decision quality of the controllers found for two large grid-world problems. [sent-197, score-0.234]
88 For the maze problem, the expected return is averaged over all 400 states since BPI tries to optimize the policy for all belief states simultaneously. [sent-204, score-0.705]
89 For comparison purposes, the expected return for the tag-avoid problem is measured at the same initial belief state used in [12] even though BPI doesn’t tailor its policy exclusively to that belief state. [sent-205, score-0.916]
90 In contrast, many point-based algorithms including PBVI [12] (which is perhaps the best such algorithm) optimize the policy for a single initial belief state, capitalizing on a hopefully small reachable belief region. [sent-206, score-0.871]
91 BPI found a -node controller in with the same expected return of achieved by PBVI in with a policy of linear segments. [sent-207, score-0.603]
92 This suggests that most of the belief space is reachable in tag-avoid. [sent-208, score-0.366]
93 We also 3£ ¤¢ ¦¡££¥ ©¦¨¤¦§ ¦@ 33 '© 3 ¦3¢¢3¢ )'¨'@ 3 ©¦¥ ¡ ¢ @ ¥ © © tangent to the backed up value function, indicating that it is identical to some backed up node. [sent-209, score-0.773]
94 ran BPI on the tiger-grid, hallway and hallway2 benchmark problems [12] and obtained -node controllers in , and achieving expected returns of , , at the same initial belief states used in [12], but without using them to tailor the policy. [sent-210, score-0.572]
95 In contrast, PBVI achieved expected returns of , and in , and with policies of , and linear segments tailored to those initial belief states. [sent-211, score-0.317]
96 This suggests that only a small portion of the belief space is reachable. [sent-212, score-0.253]
97 § © 3 ¢ © @ @ @ 33§ ''@ 7 Conclusion We have introduced the BPI algorithm, which guarantees monotonic improvement of the value function while keeping controller size fixed. [sent-214, score-0.478]
98 An analysis of such local optima reveals that the value function of each node is tangent to the backed up value function. [sent-216, score-0.964]
99 State aggregation [2] and belief compression [13] techniques could be easily integrated with BPI to scale to problems with large state spaces. [sent-219, score-0.329]
100 Also, since stochastic GA [11, 1] can tackle model free problems (which BPI cannot) it would be interesting to see if tangent belief states could be computed for stochastic GA and used to design a heuristic to escape local optima similar to the one proposed for BPI. [sent-220, score-0.969]
wordName wordTfidf (topN-words)
[('bpi', 0.421), ('controller', 0.297), ('backed', 0.253), ('belief', 0.253), ('policy', 0.252), ('tangent', 0.225), ('controllers', 0.211), ('ga', 0.205), ('node', 0.199), ('dominated', 0.163), ('escape', 0.162), ('lp', 0.15), ('optima', 0.135), ('dp', 0.133), ('hy', 0.127), ('pointwise', 0.119), ('arcs', 0.113), ('reachable', 0.113), ('nodes', 0.108), ('inward', 0.09), ('convex', 0.085), ('backup', 0.084), ('state', 0.076), ('redirected', 0.07), ('local', 0.068), ('observable', 0.066), ('improvement', 0.066), ('fsc', 0.065), ('iy', 0.065), ('jy', 0.065), ('rhy', 0.065), ('dominating', 0.061), ('pomdps', 0.061), ('action', 0.059), ('dominate', 0.056), ('states', 0.054), ('optimum', 0.051), ('combination', 0.051), ('fscs', 0.049), ('tangency', 0.049), ('observation', 0.047), ('toronto', 0.046), ('trapped', 0.045), ('jointly', 0.045), ('pi', 0.043), ('pbvi', 0.042), ('value', 0.042), ('table', 0.041), ('dominates', 0.041), ('executing', 0.039), ('monotonic', 0.039), ('iteration', 0.039), ('maze', 0.038), ('pruning', 0.038), ('policies', 0.038), ('rewards', 0.038), ('stochastic', 0.036), ('partially', 0.035), ('keeping', 0.034), ('pruned', 0.034), ('segment', 0.033), ('bounded', 0.033), ('elicitation', 0.032), ('ihy', 0.032), ('finite', 0.032), ('primal', 0.032), ('improve', 0.032), ('dual', 0.03), ('strategies', 0.029), ('return', 0.028), ('prunes', 0.028), ('lookahead', 0.028), ('feng', 0.028), ('hansen', 0.028), ('meuleau', 0.028), ('stockholm', 0.028), ('tailor', 0.028), ('seeks', 0.027), ('improved', 0.026), ('monotonically', 0.026), ('successor', 0.026), ('associates', 0.026), ('pineau', 0.026), ('poupart', 0.026), ('iw', 0.026), ('ix', 0.026), ('lm', 0.026), ('safely', 0.026), ('expected', 0.026), ('anytime', 0.024), ('tag', 0.024), ('improves', 0.023), ('proof', 0.023), ('decision', 0.023), ('planning', 0.023), ('prune', 0.02), ('pomdp', 0.02), ('boutilier', 0.02), ('vancouver', 0.02), ('kim', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 42 nips-2003-Bounded Finite State Controllers
Author: Pascal Poupart, Craig Boutilier
Abstract: We describe a new approximation algorithm for solving partially observable MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).
2 0.25630203 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
Author: Joelle Pineau, Geoffrey J. Gordon, Sebastian Thrun
Abstract: Recent developments in grid-based and point-based approximation algorithms for POMDPs have greatly improved the tractability of POMDP planning. These approaches operate on sets of belief points by individually learning a value function for each point. In reality, belief points exist in a highly-structured metric simplex, but current POMDP algorithms do not exploit this property. This paper presents a new metric-tree algorithm which can be used in the context of POMDP planning to sort belief points spatially, and then perform fast value function updates over groups of points. We present results showing that this approach can reduce computation in point-based POMDP algorithms for a wide range of problems. 1
3 0.19102019 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
4 0.1893198 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
Author: Georgios Theocharous, Leslie P. Kaelbling
Abstract: Recent research has demonstrated that useful POMDP solutions do not require consideration of the entire belief space. We extend this idea with the notion of temporal abstraction. We present and explore a new reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. We apply the algorithm to a large scale robot navigation task and demonstrate that with temporal abstraction we can consider an even smaller part of the belief space, we can learn POMDP policies faster, and we can do information gathering more efficiently.
5 0.17262597 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
6 0.14226778 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
7 0.13524225 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
8 0.092686452 55 nips-2003-Distributed Optimization in Adaptive Networks
9 0.090753078 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
10 0.088598125 62 nips-2003-Envelope-based Planning in Relational MDPs
11 0.084054537 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
12 0.081960119 32 nips-2003-Approximate Expectation Maximization
13 0.077485941 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
14 0.076653741 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
15 0.072994716 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination
16 0.070693992 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
17 0.070383236 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
18 0.068540484 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors
19 0.063981324 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
20 0.062570646 81 nips-2003-Geometric Analysis of Constrained Curves
topicId topicWeight
[(0, -0.186), (1, 0.221), (2, -0.184), (3, 0.132), (4, -0.036), (5, -0.119), (6, -0.11), (7, -0.108), (8, 0.154), (9, 0.107), (10, -0.014), (11, -0.099), (12, 0.092), (13, 0.122), (14, -0.052), (15, -0.01), (16, 0.059), (17, 0.043), (18, 0.085), (19, 0.002), (20, 0.005), (21, -0.052), (22, 0.067), (23, 0.037), (24, -0.015), (25, 0.082), (26, 0.004), (27, -0.034), (28, -0.033), (29, -0.046), (30, -0.07), (31, 0.099), (32, 0.078), (33, -0.075), (34, -0.125), (35, 0.075), (36, 0.201), (37, 0.046), (38, 0.111), (39, -0.045), (40, -0.005), (41, 0.059), (42, -0.104), (43, 0.055), (44, 0.007), (45, -0.13), (46, 0.058), (47, 0.021), (48, -0.07), (49, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.97448242 42 nips-2003-Bounded Finite State Controllers
Author: Pascal Poupart, Craig Boutilier
Abstract: We describe a new approximation algorithm for solving partially observable MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).
2 0.7118476 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
Author: Joelle Pineau, Geoffrey J. Gordon, Sebastian Thrun
Abstract: Recent developments in grid-based and point-based approximation algorithms for POMDPs have greatly improved the tractability of POMDP planning. These approaches operate on sets of belief points by individually learning a value function for each point. In reality, belief points exist in a highly-structured metric simplex, but current POMDP algorithms do not exploit this property. This paper presents a new metric-tree algorithm which can be used in the context of POMDP planning to sort belief points spatially, and then perform fast value function updates over groups of points. We present results showing that this approach can reduce computation in point-based POMDP algorithms for a wide range of problems. 1
3 0.68108904 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
Author: Georgios Theocharous, Leslie P. Kaelbling
Abstract: Recent research has demonstrated that useful POMDP solutions do not require consideration of the entire belief space. We extend this idea with the notion of temporal abstraction. We present and explore a new reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. We apply the algorithm to a large scale robot navigation task and demonstrate that with temporal abstraction we can consider an even smaller part of the belief space, we can learn POMDP policies faster, and we can do information gathering more efficiently.
4 0.64680052 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
5 0.57369709 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning
Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng
Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
6 0.52108282 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
7 0.50543737 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
8 0.49147689 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
9 0.46568093 78 nips-2003-Gaussian Processes in Reinforcement Learning
10 0.44157374 14 nips-2003-A Nonlinear Predictive State Representation
11 0.42116234 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
12 0.41694331 55 nips-2003-Distributed Optimization in Adaptive Networks
13 0.40546578 32 nips-2003-Approximate Expectation Maximization
14 0.3791717 62 nips-2003-Envelope-based Planning in Relational MDPs
15 0.31533745 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
16 0.30956104 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
17 0.30580747 187 nips-2003-Training a Quantum Neural Network
18 0.30189836 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
19 0.29830733 68 nips-2003-Eye Movements for Reward Maximization
20 0.28231212 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
topicId topicWeight
[(0, 0.116), (11, 0.017), (30, 0.022), (35, 0.048), (49, 0.346), (53, 0.051), (69, 0.015), (71, 0.056), (76, 0.02), (85, 0.083), (91, 0.077), (99, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.8084619 42 nips-2003-Bounded Finite State Controllers
Author: Pascal Poupart, Craig Boutilier
Abstract: We describe a new approximation algorithm for solving partially observable MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).
2 0.71798247 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
Author: Anton Schwaighofer, Marian Grigoras, Volker Tresp, Clemens Hoffmann
Abstract: In this article, we present a novel approach to solving the localization problem in cellular networks. The goal is to estimate a mobile user’s position, based on measurements of the signal strengths received from network base stations. Our solution works by building Gaussian process models for the distribution of signal strengths, as obtained in a series of calibration measurements. In the localization stage, the user’s position can be estimated by maximizing the likelihood of received signal strengths with respect to the position. We investigate the accuracy of the proposed approach on data obtained within a large indoor cellular network. 1
3 0.64263302 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty
Author: Harald Steck, Tommi S. Jaakkola
Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the effect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1
4 0.56641835 112 nips-2003-Learning to Find Pre-Images
Author: Jason Weston, Bernhard Schölkopf, Gökhan H. Bakir
Abstract: We consider the problem of reconstructing patterns from a feature map. Learning algorithms using kernels to operate in a reproducing kernel Hilbert space (RKHS) express their solutions in terms of input points mapped into the RKHS. We introduce a technique based on kernel principal component analysis and regression to reconstruct corresponding patterns in the input space (aka pre-images) and review its performance in several applications requiring the construction of pre-images. The introduced technique avoids difficult and/or unstable numerical optimization, is easy to implement and, unlike previous methods, permits the computation of pre-images in discrete input spaces. 1
5 0.45840603 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays
Author: Claudio Fanti, Marzia Polito, Pietro Perona
Abstract: Consider a number of moving points, where each point is attached to a joint of the human body and projected onto an image plane. Johannson showed that humans can effortlessly detect and recognize the presence of other humans from such displays. This is true even when some of the body points are missing (e.g. because of occlusion) and unrelated clutter points are added to the display. We are interested in replicating this ability in a machine. To this end, we present a labelling and detection scheme in a probabilistic framework. Our method is based on representing the joint probability density of positions and velocities of body points with a graphical model, and using Loopy Belief Propagation to calculate a likely interpretation of the scene. Furthermore, we introduce a global variable representing the body’s centroid. Experiments on one motion-captured sequence suggest that our scheme improves on the accuracy of a previous approach based on triangulated graphical models, especially when very few parts are visible. The improvement is due both to the more general graph structure we use and, more significantly, to the introduction of the centroid variable. 1
6 0.45752847 171 nips-2003-Semi-Definite Programming by Perceptron Learning
7 0.45128709 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
8 0.44709852 53 nips-2003-Discriminating Deformable Shape Classes
9 0.43512055 158 nips-2003-Policy Search by Dynamic Programming
10 0.43327329 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
11 0.43269593 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
12 0.43266797 78 nips-2003-Gaussian Processes in Reinforcement Learning
13 0.42838609 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
14 0.42167118 113 nips-2003-Learning with Local and Global Consistency
15 0.42154795 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
16 0.41473579 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
17 0.41325304 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
18 0.41186774 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
19 0.41093379 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
20 0.40962756 41 nips-2003-Boosting versus Covering