nips nips2003 nips2003-158 knowledge-graph by maker-knowledge-mining

158 nips-2003-Policy Search by Dynamic Programming


Source: pdf

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ng Stanford University Stanford, CA 94305 Sham Kakade University of Pennsylvania Philadelphia, PA 19104 Jeff Schneider Carnegie Mellon University Pittsburgh, PA 15213 Abstract We consider the policy search approach to reinforcement learning. [sent-3, score-0.627]

2 We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. [sent-4, score-1.224]

3 We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. [sent-5, score-0.185]

4 1 Introduction Policy search approaches to reinforcement learning represent a promising method for solving POMDPs and large MDPs. [sent-6, score-0.104]

5 In the policy search setting, we assume that we are given some class Π of policies mapping from the states to the actions, and wish to find a good policy π ∈ Π. [sent-7, score-1.357]

6 A common problem with policy search is that the search through Π can be difficult and computationally expensive, and is thus typically based on local search heuristics that do not come with any performance guarantees. [sent-8, score-0.741]

7 [5, 4]), then we can derive an efficient policy search algorithm that terminates after a polynomial number of steps. [sent-10, score-0.673]

8 We also provide non-trivial guarantees on the quality of the policies found, and demonstrate the algorithm on several problems. [sent-12, score-0.217]

9 2 Preliminaries We consider an MDP with state space S; initial state s0 ∈ S; action space A; state transition probabilities {Psa (·)} (here, Psa is the next-state distribution on taking action a in state s); and reward function R : S → R, which we assume to be bounded in the interval [0, 1]. [sent-13, score-0.543]

10 In the setting in which the goal is to optimize the sum of discounted rewards over an infinitehorizon, it is well known that an optimal policy which is both Markov and stationary (i. [sent-14, score-0.693]

11 For this reason, learning approaches to infinite-horizon discounted MDPs have typically focused on searching for stationary policies (e. [sent-17, score-0.27]

12 In this work, we consider policy search in the space of non-starionary policies, and show how, with a base distribution, this allows us to derive an efficient algorithm. [sent-20, score-0.629]

13 , [6]) Given a non-stationary policy (πt , πt+1 , . [sent-28, score-0.523]

14 , πT −1 )] as the expected (normalized) sum of rewards attained by starting at state s and the “clock” at time t, taking one action according to πt , taking the next action according to πt+1 , and so on. [sent-40, score-0.288]

15 ,πT −1 (s)], where the “s ∼ Psπt (s) ” subscript indicates that the expectation is with respect to s drawn from the state transition distribution Psπt (s) . [sent-47, score-0.125]

16 In our policy search setting, we consider a restricted class of deterministic, stationary policies Π, where each π ∈ Π is a map π : S → A, and a corresponding class of non-stationary policies ΠT = {(π0 , π1 , . [sent-48, score-1.043]

17 In the partially observed, POMDP setting, we may restrict Π to contain policies that depend only on the observable aspects of the state, in which case we obtain a class of memoryless/reactive policies. [sent-52, score-0.193]

18 Our goal is to find a non-stationary policy (π0 , π1 . [sent-53, score-0.523]

19 Informally, we think of µt as indicating to the algorithm approximately how often we think a good policy visits each state at time t. [sent-64, score-0.668]

20 Each policy πt is chosen from the stationary policy class Π. [sent-71, score-1.134]

21 ,πT −1 (s)] In other words, we choose πt from Π so as to maximize the expected sum of future rewards for executing actions according to the policy sequence (πt , πt+1 , . [sent-78, score-0.633]

22 , πT −1 ) when starting from a random initial state s drawn from the baseline distribution µt . [sent-81, score-0.192]

23 , µT −1 provides the distribution over the state space that the algorithm is optimizing with respect to, we might hope that if a good policy tends to visit the state space in a manner comparable to this base distribution, then PSDP will return a good policy. [sent-85, score-0.882]

24 The theorem also allows for the situation where the maximization step in the algorithm (the arg maxπ ∈Π ) can be done only approximately. [sent-87, score-0.134]

25 , πT −1 ), define the future state distribution µπ,t (s) = Pr(st = s|s0 , π). [sent-93, score-0.155]

26 µπ,t (s) is the probability that we will be in state s at time t if picking actions according to π and starting from state s0 . [sent-96, score-0.252]

27 , µt ), define the average variational distance between them to be1 dvar (µ, µ ) ≡ 1 T T −1 |µt (s) − µt (s)| t=0 s∈S Hence, if πref is some policy, then dvar (µ, µπref ) represents how much the base distribution µ differs from the future state distribution of the policy πref . [sent-103, score-0.957]

28 , πT −1 ) be a non-stationary policy returned by an ε-approximate version of PSDP in which, on each step, the policy π t found comes within ε of maximizing the value. [sent-107, score-1.046]

29 (1) Then for all πref ∈ ΠT we have that Vπ (s0 ) ≥ Vπref (s0 ) − T ε − T dvar (µ, µπref ) . [sent-117, score-0.107]

30 ,πT −1 (s)] − T dvar (µπref , µ) ≤ T ε + T dvar (µπref , µ) where we have used equation (1) and the fact that πref ∈ ΠT . [sent-170, score-0.214]

31 This theorem shows that PSDP returns a policy with performance that competes favorably against those policies πref in ΠT whose future state distributions are close to µ. [sent-172, score-0.87]

32 Hence, we expect our algorithm to provide a good policy if our prior knowledge allows us to choose a µ that is close to a future state distribution for a good policy in ΠT . [sent-173, score-1.27]

33 It is also shown in [4] that the dependence on dvar is tight in the worst case. [sent-174, score-0.107]

34 [6, 8]) that the ε-approximate PSDP can be implemented using a number of samples that is linear in the VC dimension of Π, polynomial in T and 1 , ε but otherwise independent of the size of the state space. [sent-176, score-0.141]

35 1 Discrete observation POMDPs Finding memoryless policies for POMDPs represents a difficult and important problem. [sent-181, score-0.243]

36 Further, it is known that the best memoryless, stochastic, stationary policy can perform better by an arbitrarily large amount than the best memoryless, deterministic policy. [sent-182, score-0.648]

37 Four natural classes of memoryless policies to consider are as follows: stationary deterministic (SD), stationary stochastic (SS), non-stationary deterministic (ND) and non-stationary stochastic (NS). [sent-185, score-0.555]

38 Let the operator opt return the value of the optimal policy in a class. [sent-186, score-0.686]

39 To see that opt(ND) = opt(NS), let µNS be the future distribution of an optimal policy πN S ∈ NS. [sent-189, score-0.602]

40 After each update, the resulting policy (πNS,0 , πNS,1 , . [sent-191, score-0.523]

41 Essentially, we can consider PSDP as sweeping through each timestep and modifying the stochastic policy to be deterministic, while never decreasing performance. [sent-198, score-0.607]

42 The potentially superior performance of non-stationary policies contrasted with stationary stochastic ones provides further justification for their use. [sent-200, score-0.292]

43 Furthermore, the last inequality suggests that only considering deterministic policies is sufficient in the non-stationary regime. [sent-201, score-0.23]

44 Unfortunately, one can show that it is NP-hard to exactly or approximately find the best policy in any of these classes (this was shown for SD in [7]). [sent-202, score-0.523]

45 While many search heuristics have been proposed, we now show PSDP offers a viable, computationally tractable, alternative for finding a good policy for POMDPs, one which offers performance guarantees in the form of Theorem 1. [sent-203, score-0.671]

46 Proposition 2 (PSDP complexity) For any POMDP, exact PSDP (ε = 0) runs in time polynomial in the size of the state and observation spaces and in the horizon time T . [sent-204, score-0.171]

47 Under PSDP, the policy update is as follows: πt (o) = arg maxa Es∼µt [p(o|s)Va,πt+1 . [sent-205, score-0.587]

48 ,πT −1 (s)] , (2) where p(o|s) is the observation probabilities of the POMDP and the policy sequence (a, πt+1 . [sent-208, score-0.523]

49 It is clear that given the policies from time t + 1 onwards, Va,πt+1 . [sent-212, score-0.173]

50 Ideally, it is the distribution provided by an optimal policy for ND that optimally specifies this tradeoff. [sent-217, score-0.572]

51 This result does not contradict the NP-hardness results, because it requires that a good baseline distribution µ be provided to the algorithm. [sent-218, score-0.095]

52 However, if µ is the future state distribution of the optimal policy in ND, then PSDP returns an optimal policy for this class in polynomial time. [sent-219, score-1.31]

53 Furthermore, if the state space is prohibitively large to perform the exact update in equation 2, then Monte Carlo integration may be used to evaluate the expectation over the state space. [sent-220, score-0.2]

54 This leads to an ε-approximate version of PSDP, where one can obtain an algorithm with no dependence on the size of the state space and a polynomial dependence on the number of observations, T , and 1 (see discussion in [4]). [sent-221, score-0.162]

55 ) ˜ If the policy πt is greedy with respect to the action value Va,πt+1 . [sent-239, score-0.582]

56 ,πT −1 (s) then it follows immediately from Theorem 1 that our policy value differs from the optimal one by 2T plus the µ dependent variational penalty term. [sent-242, score-0.547]

57 It is important to note that this error is phrased in terms of an average error over state-space, as opposed to the worst case errors over the state space that are more standard in RL. [sent-243, score-0.1]

58 PSDP, however, as it does not use value function backups, cannot make this same error; the use of the computed policies in the future keeps it honest. [sent-245, score-0.203]

59 3 Linear policy MDPs We now examine in detail a particular policy search example in which we have a twoaction MDP, and a linear policy class is used. [sent-248, score-1.655]

60 ,πT −1 (s)] (from the maximization step in the algorithm) can be nearly maximized by some linear policy π, then a good approximation to π can be found. [sent-252, score-0.598]

61 Here, φ(s) ∈ Rn is a vector of features of the state s. [sent-254, score-0.1]

62 Using m2 Monte Carlo samples, it determines if action a1 or action a2 is preferable from that state, and creates a “label” y (i) for that state accordingly. [sent-272, score-0.218]

63 The final maximization step can be approximated via a call to any standard supervised learning algorithm that tries to find linear decision boundaries, such as a support vector machine or logistic regression. [sent-275, score-0.101]

64 Specifically, if θ ∗ = arg minθ T (θ), then [1] presents a polynomial time algorithm that finds θ so that T (θ) ≤ (n + 1)T (θ ∗ ). [sent-283, score-0.1]

65 Therefore, if there is a linear policy that does well, we also find a policy that does well. [sent-285, score-1.046]

66 (Conversely, if there is no linear policy that does well—i. [sent-286, score-0.523]

67 , if T (θ ∗ ) above were large—then the bound would be very loose; however, in this setting there is no good linear policy, and hence we arguably should not be using a linear policy anyway or should consider adding more features. [sent-288, score-0.547]

68 1 POMDP gridworld example Here we apply PSDP to some simple maze POMDPs (Figure (5. [sent-291, score-0.113]

69 In each the robot can move in any of the 4 cardinal direction. [sent-293, score-0.105]

70 1c), the observation at each grid-cell is simply the directions in which the robot can freely move. [sent-295, score-0.105]

71 The robot here is confounded by all the middle states appearing the same, and the optimal stochastic policy must take time at least quadratic in the length of the hallway to ensure it gets to the goal from both sides. [sent-299, score-0.763]

72 PSDP deduces a non-stationary deterministic policy with much better performance: first clear the left half maze by always traveling right and then the right half maze by always traveling left. [sent-300, score-0.862]

73 When one allows non-stationary policies, however, solutions do exist: PSDP provides a policy with 55 total steps to goal. [sent-303, score-0.543]

74 The next column lists the performance of iterating PSDP, starting initially with a uniform baseline µ and then computing with a new baseline µ based on the previously constructed policy. [sent-308, score-0.153]

75 2 Column 3 corresponds to optimal stationary deterministic 2 It can be shown that this procedure of refining µ based on previous learned policies will never decrease performance. [sent-309, score-0.322]

76 policy while the final column gives the best theoretically achievable performance given arbitrary memory. [sent-310, score-0.563]

77 2 µ uniform 21 55 412 µ iterated 21 48 412 Optimal SD ∞ ∞ 416 Optimal 18 39 ≥ 408 Robot walking Our work is related in spirit to Atkeson and Morimoto [2], which describes a differential dynamic programming (DDP) algorithm that learns quadratic value functions along trajectories. [sent-313, score-0.137]

78 A central difference is their use of the value function backups as opposed to policy backups. [sent-315, score-0.549]

79 [2] considers a planar biped robot that walks along a bar. [sent-317, score-0.192]

80 The robot has two legs and a motor that applies torque where they meet. [sent-318, score-0.161]

81 As the robot lacks knees, it walks by essentially brachiating (upside-down); a simple mechanism grabs the bar as a foot swings into position. [sent-319, score-0.218]

82 The robot (excluding the position horizontally along the bar) can be described in a 5 dimensional state space using angles and angular velocities from the foot grasping the bar. [sent-320, score-0.365]

83 In [2], significant manual “cost-function engineering” or “shaping” of the rewards was used to achieve walking at fixed speed. [sent-322, score-0.094]

84 As this limitation does not apply to our algorithm, we used a cost function that rewards the robot for each timestep it remains upright. [sent-325, score-0.207]

85 Samples of µ are generated in the same way [2] generates initial trajectories, using a parametric policy search. [sent-328, score-0.523]

86 For our policy we approximated the action-value function with a locally-weighted linear regression. [sent-329, score-0.523]

87 PSDP’s policy significantly improves performance over the parametric policy search; while both keep the robot walking we note that PSDP incurs 31% less cost per step. [sent-330, score-1.216]

88 DDP makes strong, perhaps unrealistic assumptions about the observability of state variables. [sent-331, score-0.1]

89 PSDP, in contrast, can learn policies with limited observability. [sent-332, score-0.173]

90 By hiding state variables from the algorithm, this control problem demonstrates PSDP’s leveraging of nonstationarity and ability to cope with partial observability. [sent-333, score-0.136]

91 PSDP can make the robot walk without any observations; open loop control is sufficient to propel the robot, albeit at a significant reduction in performance and robustness. [sent-334, score-0.161]

92 This complex torque signal would be identical for arbitrary initial conditions— modulo sign-reversals, as the applied torque at the hip is inverted from the control signal whenever the stance foot is switched. [sent-337, score-0.23]

93 3 Double-pole balancing Our third problem, double pole balancing, is similar to the standard inverted pendulum problem, except that two unactuated poles, rather than a single one, are attached to the cart, and it is our task to simultaneously keep both of them balanced. [sent-339, score-0.096]

94 The state variables were the cart position x; cart velocity x; the two poles’ angles φ 1 and ˙ ˙ ˙ φ2 ; and the poles’ angular velocities φ1 and φ2 . [sent-342, score-0.271]

95 The dashed line in each indicates which foot is grasping the bar at each time. [sent-359, score-0.124]

96 We used a linear policy class Π as described previously, and ˙ ˙ φ(s) = [x, x, φ1 , φ1 , φ2 , φ2 ]T . [sent-361, score-0.543]

97 By symmetry of the problem, a constant intercept term ˙ was unnecessary; leaving out an intercept enforces that if a1 is the better action for some state s, then a2 should be taken in the state −s. [sent-362, score-0.311]

98 3 The baseline distribution µ that we chose was a zero-mean multivariate Gaussian distribution over all the state variables. [sent-364, score-0.196]

99 Non-parametric representation of a policies and value functions: A trajectory based approach. [sent-381, score-0.173]

100 P EGASUS: A policy search method for large MDPs and POMDPs. [sent-409, score-0.589]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('psdp', 0.587), ('policy', 0.523), ('ref', 0.282), ('policies', 0.173), ('st', 0.146), ('opt', 0.139), ('maze', 0.113), ('dvar', 0.107), ('robot', 0.105), ('state', 0.1), ('va', 0.085), ('pomdps', 0.079), ('est', 0.07), ('memoryless', 0.07), ('pt', 0.069), ('stationary', 0.068), ('pomdp', 0.067), ('ns', 0.067), ('search', 0.066), ('action', 0.059), ('deterministic', 0.057), ('torque', 0.056), ('foot', 0.056), ('poles', 0.053), ('sham', 0.053), ('timestep', 0.053), ('hallway', 0.052), ('maximization', 0.051), ('rewards', 0.049), ('baseline', 0.046), ('walking', 0.045), ('es', 0.044), ('sd', 0.043), ('atkeson', 0.042), ('cart', 0.042), ('polynomial', 0.041), ('mdps', 0.04), ('base', 0.04), ('reinforcement', 0.038), ('arg', 0.038), ('ss', 0.037), ('andrew', 0.037), ('control', 0.036), ('biped', 0.036), ('ddp', 0.036), ('grasping', 0.036), ('pole', 0.036), ('psa', 0.036), ('ps', 0.034), ('mccallum', 0.034), ('balancing', 0.034), ('bar', 0.032), ('programming', 0.032), ('stochastic', 0.031), ('actions', 0.031), ('backed', 0.031), ('kakade', 0.031), ('horizon', 0.03), ('future', 0.03), ('logistic', 0.029), ('discounted', 0.029), ('begins', 0.028), ('traveling', 0.028), ('bagnell', 0.028), ('instantiations', 0.028), ('states', 0.028), ('monte', 0.028), ('sutton', 0.027), ('carlo', 0.027), ('inverted', 0.026), ('backups', 0.026), ('maxa', 0.026), ('intercept', 0.026), ('planar', 0.026), ('distribution', 0.025), ('visit', 0.025), ('velocities', 0.025), ('walks', 0.025), ('good', 0.024), ('theorem', 0.024), ('optimal', 0.024), ('demonstrate', 0.023), ('terminates', 0.022), ('angular', 0.022), ('accelerate', 0.022), ('starting', 0.021), ('algorithm', 0.021), ('angles', 0.021), ('penalize', 0.021), ('nd', 0.02), ('column', 0.02), ('steps', 0.02), ('dynamic', 0.02), ('class', 0.02), ('ef', 0.02), ('performance', 0.02), ('pa', 0.02), ('velocity', 0.019), ('spirit', 0.019), ('offers', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

2 0.3052043 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

3 0.26923925 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

4 0.19102019 42 nips-2003-Bounded Finite State Controllers

Author: Pascal Poupart, Craig Boutilier

Abstract: We describe a new approximation algorithm for solving partially observable MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).

5 0.16991568 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

Author: Arnab Nilim, Laurent El Ghaoui

Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to realworld problems. We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.

6 0.16095364 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

7 0.15252694 55 nips-2003-Distributed Optimization in Adaptive Networks

8 0.14301601 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

9 0.13624336 62 nips-2003-Envelope-based Planning in Relational MDPs

10 0.13114631 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

11 0.1097085 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

12 0.10181134 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

13 0.094396912 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

14 0.087515332 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

15 0.086980045 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs

16 0.071173742 68 nips-2003-Eye Movements for Reward Maximization

17 0.067839228 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

18 0.056770429 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

19 0.054399446 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

20 0.052015465 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.182), (1, 0.343), (2, -0.177), (3, 0.028), (4, -0.06), (5, 0.047), (6, -0.123), (7, -0.162), (8, 0.17), (9, 0.107), (10, -0.086), (11, -0.112), (12, 0.048), (13, 0.1), (14, -0.043), (15, 0.007), (16, 0.008), (17, -0.048), (18, -0.061), (19, -0.064), (20, -0.024), (21, -0.045), (22, 0.138), (23, 0.017), (24, -0.121), (25, -0.025), (26, 0.079), (27, -0.039), (28, 0.029), (29, -0.118), (30, -0.023), (31, -0.045), (32, 0.019), (33, 0.048), (34, 0.006), (35, 0.045), (36, 0.034), (37, 0.073), (38, -0.034), (39, -0.044), (40, 0.034), (41, -0.001), (42, -0.015), (43, -0.009), (44, -0.037), (45, -0.081), (46, -0.061), (47, 0.079), (48, 0.057), (49, -0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96784717 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

2 0.88172537 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

3 0.8008073 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng

Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.

4 0.73352748 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

5 0.6849016 62 nips-2003-Envelope-based Planning in Relational MDPs

Author: Natalia H. Gardiol, Leslie P. Kaelbling

Abstract: A mobile robot acting in the world is faced with a large amount of sensory data and uncertainty in its action outcomes. Indeed, almost all interesting sequential decision-making domains involve large state spaces and large, stochastic action sets. We investigate a way to act intelligently as quickly as possible in domains where finding a complete policy would take a hopelessly long time. This approach, Relational Envelopebased Planning (REBP) tackles large, noisy problems along two axes. First, describing a domain as a relational MDP (instead of as an atomic or propositionally-factored MDP) allows problem structure and dynamics to be captured compactly with a small set of probabilistic, relational rules. Second, an envelope-based approach to planning lets an agent begin acting quickly within a restricted part of the full state space and to judiciously expand its envelope as resources permit. 1

6 0.6813693 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

7 0.64350688 42 nips-2003-Bounded Finite State Controllers

8 0.61679304 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

9 0.55149758 55 nips-2003-Distributed Optimization in Adaptive Networks

10 0.45944405 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

11 0.42212477 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

12 0.39065313 68 nips-2003-Eye Movements for Reward Maximization

13 0.38561136 2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality

14 0.3568317 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

15 0.33006674 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

16 0.31303748 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

17 0.29693088 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

18 0.2797375 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs

19 0.26597148 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

20 0.23600945 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.079), (11, 0.03), (22, 0.228), (29, 0.015), (30, 0.023), (35, 0.065), (53, 0.069), (71, 0.062), (76, 0.041), (85, 0.084), (91, 0.101), (99, 0.096)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81644815 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

2 0.66878253 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

Author: Michael I. Jordan, Martin J. Wainwright

Abstract: We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaussian entropy bound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant maximization problem that can be solved by efficient interior point methods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our variational problem is strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a substantial margin for certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e.g., 3]. 1

3 0.64034259 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus

Author: Zach Solan, David Horn, Eytan Ruppin, Shimon Edelman

Abstract: We describe a pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learning structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm represents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are represented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be subjected to a standard test of English as a Second Language (ESL) proficiency. The results are encouraging: the model attains a level of performance considered to be “intermediate” for 9th-grade students, despite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children. 1

4 0.63470197 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

5 0.63246477 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

6 0.63220072 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

7 0.63042879 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

8 0.62411541 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

9 0.61356449 189 nips-2003-Tree-structured Approximations by Expectation Propagation

10 0.6108849 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

11 0.60831285 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

12 0.60706383 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

13 0.60121793 121 nips-2003-Log-Linear Models for Label Ranking

14 0.59934163 42 nips-2003-Bounded Finite State Controllers

15 0.59882373 113 nips-2003-Learning with Local and Global Consistency

16 0.59644842 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

17 0.59576082 55 nips-2003-Distributed Optimization in Adaptive Networks

18 0.59320378 30 nips-2003-Approximability of Probability Distributions

19 0.59257084 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

20 0.59238452 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition