nips nips2001 nips2001-121 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. [sent-6, score-0.679]
2 We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. [sent-8, score-0.214]
3 Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. [sent-10, score-0.068]
4 The result is an off-policy method which can use (or reuse) data collected from any source. [sent-12, score-0.047]
5 We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials. [sent-13, score-0.762]
6 1 Introduction Linear least squares function approximators offer many advantages in the context of reinforcement learning. [sent-14, score-0.19]
7 Our enthusiasm for this approach is inspired by the least squares temporal difference learning algorithm (LSTD) [4]. [sent-17, score-0.137]
8 Although it is initially appealing to attempt to use LSTD in the evaluation step of a policy iteration algorithm, this combination can be problematic. [sent-19, score-0.664]
9 Koller and Parr [5] present an example where the combination of LSTD style function approximation and policy iteration oscillates between two bad policies in an MDP with just 4 states. [sent-20, score-0.709]
10 This behavior is explained by the fact that linear approximation methods such as LSTD compute an approximation that is weighted by the state visitation frequencies of the policy under evaluation. [sent-21, score-0.66]
11 Further, even if this problem is overcome, a more serious difficulty is that the state value function that LSTD learns is of no use for policy improvement when a model of the process is not available. [sent-22, score-0.632]
12 First, we introduce LSQ, an algorithm that learns least squares approximations of the state-action ( ) value function, thus permitting action selection and policy improvement without a model. [sent-24, score-0.822]
13 Next we introduce LSPI which uses the results of LSQ to form an approximate policy iteration algorithm. [sent-25, score-0.685]
14 This algorithm combines the policy search efficiency of policy iteration with the data efficiency of LSTD. [sent-26, score-1.149]
15 It is completely off policy and can, in principle, use data collected from any reasonable sampling distribution. [sent-27, score-0.597]
16 We have evaluated this method on several problems, including a simulated bicycle control problem in which LSPI learns to guide the bicycle to the goal by observing a relatively small number of completely random trials. [sent-28, score-0.73]
17 §£¨© © )1 ¢ % 0('&$ ¦ % ¢ # " §32 ¤ ¤ ¡ 4 We will be assuming that the MDP has an infinite horizon and that future rewards are discounted exponentially with a discount factor . [sent-32, score-0.068]
18 (If we assume that all policies are proper, our results generalize to the undiscounted case. [sent-33, score-0.045]
19 ) A stationary policy for an , where is the action the agent takes at state . [sent-34, score-0.645]
20 The MDP is a mapping state-action value function is defined over all possible combinations of states and actions and indicates the expected, discounted total reward when taking action in state and following policy thereafter. [sent-35, score-0.863]
21 size Q , where ¤ g§ £P§¤ 3¡ ¡ D D where For every MDP, there exists an optimal policy, , which maximizes the expected, discounted return of every state. [sent-41, score-0.05]
22 Policy iteration is a method of discovering this policy by iterating through a sequence of monotonically improving policies. [sent-42, score-0.637]
23 Value determination computes the state-action values for a policy by solving the above system. [sent-44, score-0.527]
24 ¥D ¤ U§£¡ i gf Q vh FD e d A"¨¥ t4£gdFD V ¡ 3 Least Squares Approximation of Q Functions Policy iteration relies upon the solution of a system of linear equations to find the Q values for the current policy. [sent-47, score-0.146]
25 This is impractical for large state and action spaces. [sent-48, score-0.154]
26 In such cases we may wish to approximate with a parametric function approximator and do some form of approximate policy iteration. [sent-49, score-0.587]
27 We are interested in a set of weights that yields a fixed point in value function space, that is a value function that is invariant under one step of value determination followed by orthogonal projection to the space spanned by the basis functions. [sent-53, score-0.222]
28 " ¡ We note that this is the standard fixed point approximation method for linear value functions with the exception that the problem is formulated in terms of Q values instead of state , the solution is guaranteed to exist for all but finitely many [5]. [sent-56, score-0.136]
29 In many practical applications, such a model is not available and the value function or, more precisely, its parameters have to be learned from sampled data. [sent-59, score-0.056]
30 These sampled data are tuples of the form: , meaning that in state , action was taken, a reward was received, and the resulting state was . [sent-60, score-0.309]
31 These data can be collected from actual (sequential) episodes or from random queries to a generative model of the MDP. [sent-61, score-0.128]
32 The sampling distribution from the summations is governed by the underlying dynamics ( ) of the process as the samples in are taken directly from the MDP. [sent-67, score-0.14]
33 g Q W ¤ ¤ ¡ §£¨© and can be approximated as ¡ ¡ 2 V %¡ 5 q ¡ ¡ Q q ¡ ¡ 0 V ¤ U§3¡ ¡ 2 0 d 2 2 ¡ ¦ V P 2 ¡ ¢ will converge to the true solution ¢ 6V 2 ! [sent-68, score-0.037]
34 V 0 ¡ V Q ¡ ¢ ¡ In the more general case, where is not uniform, we will compute a weighted projection, which minimizes the weighted distance in the projection step. [sent-71, score-0.097]
35 Thus, state is implicitly assigned weight and the projection minimizes the weighted sum of squared errors , giving high with respect to . [sent-72, score-0.142]
36 In LSTD, for example, is the stationary distribution of weight to frequently visited states, and low weight to infrequently visited states. [sent-73, score-0.042]
37 Assume that initially contributes to the approximation 0 2 ¡ ¨ ( ¤ U§£¡ ¤ ¦2 Y F2 ¡ and V 0 ¡ ¡ V 0 2 ¡ ¡ ¡ ¡ 0 Y )¤ ¡ 0 0 V This observation leads to an incremental update rule for and . [sent-75, score-0.054]
38 It is a feature of this algorithm that it can use the same set of samples to compute Q values for any policy representation that offers an action choice for each in the set. [sent-79, score-0.676]
39 Thus, policy merely determines which LSQ can use every single sample available to it no matter what policy is under evaluation. [sent-81, score-1.004]
40 We note that if a particular set of projection weights are desired, it is straightforward to reweight the samples as they are added to . [sent-82, score-0.148]
41 r3Pd3u£P§£"3¡ ¡ ¡ 0 Notice that apart from storing the samples, LSQ requires only space independently of the size of the state and the action space. [sent-84, score-0.154]
42 § ¡ Ut ¨ © § 4 Ut ¡ ¡ ¡ 2 0 ¡ § ¡ t ¡ 0 LSQ includes LSTD as a special case where there is only one action available. [sent-87, score-0.103]
43 LSQ is also applicable in the case of infinite and continuous state and/or action spaces with no modification. [sent-89, score-0.154]
44 States and actions are reflected only through the basis functions of the linear approximation and the resulting value function can cover the entire state-action space with the appropriate set of continuous basis functions. [sent-90, score-0.212]
45 5 LSPI: Least Squares Policy Iteration The LSQ algorithm provides a means of learning an approximate state-action value function, , for any fixed policy . [sent-91, score-0.568]
46 We now integrate LSQ into an approximate policy iteration algorithm. [sent-92, score-0.685]
47 Clearly, LSQ is a candidate for the value determination step. [sent-93, score-0.065]
48 The key insight is that we can achieve the policy improvement step without ever explicitly representing our policy and without any sort of model. [sent-94, score-1.045]
49 Recall that in policy improvement, will pick the action that maximizes . [sent-95, score-0.615]
50 For very large or continuous action spaces, explicit maximization over may be impractical. [sent-97, score-0.103]
51 0 Since LSPI uses LSQ to compute approximate Q functions, it can use any data source for samples. [sent-99, score-0.048]
52 A single set of samples may be used for the entire optimization, or additional samples may be acquired, either through trajectories or some other scheme, for each iteration of policy iteration. [sent-100, score-0.853]
53 As with any approximate policy iteration algorithm, the convergence of LSPI is not guaranteed. [sent-102, score-0.685]
54 Approximate policy iteration variants are typically analyzed in terms of a value function approximation error and an action selection error [2]. [sent-103, score-0.796]
55 LSPI does not require an approximate policy representation, e. [sent-104, score-0.539]
56 , a policy function or “actor” architecture, removing one source of error. [sent-106, score-0.491]
57 Moreover, the direct computation of linear Q functions from any data source, including stored data, allows the use of all available data to evaluate every policy, making the problem of minimizing value function approximation error more manageable. [sent-107, score-0.105]
58 LSPI easily found the optimal policy within a few iterations using actual trajectories. [sent-109, score-0.491]
59 We also tested LSPI on the inverted pendulum problem, where the task is to balance a pendulum in the upright position by moving the cart to which it is attached. [sent-110, score-0.14]
60 Using a simple set of basis functions and samples collected from random episodes (starting in the upright position and following a purely random policy), LSPI was able to find excellent policies using a few hundred such episodes [7]. [sent-111, score-0.464]
61 Finally, we tried a bicycle balancing problem [12] in which the goal is to learn to balance and ride a bicycle to a target position located 1 km away from the starting location. [sent-112, score-0.703]
62 Initially, the bicycle’s orientation is at an angle of 90 to the goal. [sent-113, score-0.037]
63 The state description is a sixdimensional vector , where is the angle of the handlebar, is the vertical £ ¡ § ¥£ ¤¤ ¦¤ ¢ £ h¤¤ ¤ £ ¢¡ ¤ ¡ ¡ 1 This is the same principle that allows action selection without a model in Q-learning. [sent-114, score-0.191]
64 To our knowledge, this is the first application of this principle in an approximate policy iteration algorithm. [sent-115, score-0.685]
65 ¥ § G¥ x G£ ¡ ¨¤F¦B"¤G ¢G @ $G ) // : Number of basis functions // : Basis functions // : Discount factor // : Stopping criterion : Initial policy, given as , // // : Initial set of samples, possibly empty @ G C A FD$x © ¥ 5 ¥ x ) ¥ © ¡ G £ @ $G § G © u© " #! [sent-116, score-0.091]
66 angle of the bicycle, and is the angle of the bicycle to the goal. [sent-118, score-0.385]
67 The actions are the torque applied to the handlebar (discretized to ) and the displacement of the rider (discretized to ). [sent-119, score-0.127]
68 In our experiments, actions are restricted to be either or (or nothing) giving a total of 5 actions 2 . [sent-120, score-0.143]
69 The dynamics of the bicycle are based on the model described by Randløv and Alstrøm [12] and the time step of the simulation is set to seconds. [sent-122, score-0.311]
70 This block of basis functions is repeated for each of the 5 actions, giving a total of 100 basis functions and weights. [sent-125, score-0.145]
71 Training data were collected by initializing the bicycle to a random state around the equilibrium position and running small episodes of 20 steps each using a purely random policy. [sent-126, score-0.543]
72 We used the same data set for each run of policy iteration and usually obtained convergence in 6 or 7 iterations. [sent-128, score-0.637]
73 Successful policies usually reached the goal in approximately 1 km total, near optimal performance. [sent-129, score-0.097]
74 We also show an annotated set of trajectories to demonstrate the performance improvement over multiple steps of policy iteration in Figure 2(b). [sent-130, score-0.744]
75 ¥£ @ ( § § D ( V (§ @ 1 2§ § D ( V (§ The following design decisions influenced the performance of LSPI on this problem: As is typical with this problem, we used a shaping reward [10] for the distance to the goal. [sent-131, score-0.128]
76 We found that when using full random trajectories, most of our sample points were not very useful; they occurred after the bicycle had already entered into a “death spiral” from which recovery was impossible. [sent-133, score-0.333]
77 This complicated our learning efforts by biasing the samples towards hopeless parts of the space, so we decided to cut off trajectories after 20 steps. [sent-134, score-0.134]
78 This created an additional problem because there was no terminating reward signal to indicate failure. [sent-135, score-0.077]
79 We approximated this with an additional shaping reward, which was proportional to the B & @ Q@ 2 Results are similar for the full 9-action case, but required more training data. [sent-136, score-0.088]
80 The curves at the end of the trajectories indicating where the bicycle has looped back for a second pass through the goal. [sent-142, score-0.363]
81 Finally, we used a discount of , which seemed to yield more robust performance. [sent-146, score-0.039]
82 @ ¦ & @ We admit to some slight unease about the amount of shaping and adjusting of parameters that was required to obtain good results on this problem. [sent-147, score-0.051]
83 To verify that we had not eliminated the learning problem entirely through shaping, we reran some experiments using a discount of . [sent-148, score-0.066]
84 In this case LSQ simply projects the immediate reward function into the column space of the basis functions. [sent-149, score-0.134]
85 If the problem were tweaked too much, acting to maximize the projected immediate reward would be sufficient to obtain good performance. [sent-150, score-0.101]
86 @ 7 Discussion and Conclusions We have presented a new, model-free approximate policy iteration algorithm called LSPI, which is inspired by the LSTD algorithm. [sent-152, score-0.685]
87 This algorithm is able to use either a stored repository of samples or samples generated dynamically from trajectories. [sent-153, score-0.164]
88 It performs action selection and approximate policy iteration entirely in value function space, without any need for model. [sent-154, score-0.844]
89 In contrast to other approaches to approximate policy iteration, it does not require any sort of approximate policy function. [sent-155, score-1.11]
90 Rather than using our samples to implicitly construct an approximate model using kernels, we operate entirely in value function space and use our samples directly in the value function projection step. [sent-157, score-0.34]
91 0 § 0 Percentage of trials reaching the goal 3rd iteration 70 In comparison to direct policy search methods [9, 8, 1, 13, 6], we offer the strength of policy iteration. [sent-161, score-1.192]
92 Policy search methods typically make a large number of relatively small steps of gradient-based policy updates to a parameterized policy function. [sent-162, score-1.049]
93 Our use of policy iteration generally results in a small number of very large steps directly in policy space. [sent-163, score-1.152]
94 We achieved good performance on the bicycle task using a very small number of randomly generated samples that were reused across multiple steps of policy iteration. [sent-165, score-0.908]
95 We believe that the direct approach to function approximation and data reuse taken by LSPI will make the algorithm an intuitive and easy to use first choice for many reinforcement learning tasks. [sent-167, score-0.126]
96 In future work, we plan to investigate the application of our method to multi-agent systems and the use of density estimation to control the projection weights in our function approximator. [sent-168, score-0.066]
97 PEGASUS: A policy search method for large MDPs and POMDPs. [sent-224, score-0.512]
98 Policy invariance under reward transformations: theory and application to reward shaping. [sent-236, score-0.154]
99 Learning to drive a bicycle using reinforcement learning and shaping. [sent-249, score-0.365]
100 Policy gradient methods for reinforcement learning with function approximation. [sent-257, score-0.054]
wordName wordTfidf (topN-words)
[('policy', 0.491), ('lspi', 0.439), ('lsq', 0.366), ('bicycle', 0.311), ('lstd', 0.311), ('iteration', 0.146), ('action', 0.103), ('samples', 0.082), ('episodes', 0.081), ('reward', 0.077), ('squares', 0.076), ('mdp', 0.073), ('parr', 0.072), ('vh', 0.067), ('gw', 0.064), ('actions', 0.061), ('ru', 0.058), ('alstr', 0.055), ('randl', 0.055), ('reinforcement', 0.054), ('trajectories', 0.052), ('state', 0.051), ('shaping', 0.051), ('approximate', 0.048), ('duke', 0.048), ('collected', 0.047), ('koller', 0.046), ('policies', 0.045), ('sixteenth', 0.043), ('fd', 0.043), ('projection', 0.043), ('discount', 0.039), ('approximated', 0.037), ('angle', 0.037), ('crash', 0.037), ('durham', 0.037), ('handlebar', 0.037), ('hht', 0.037), ('hyt', 0.037), ('lagoudakis', 0.037), ('pendulum', 0.037), ('qhx', 0.037), ('sht', 0.037), ('upright', 0.037), ('visitation', 0.037), ('yht', 0.037), ('determination', 0.036), ('pg', 0.036), ('basis', 0.033), ('completely', 0.033), ('sort', 0.032), ('simulator', 0.032), ('summations', 0.032), ('morgan', 0.032), ('kaufmann', 0.031), ('improvement', 0.031), ('da', 0.031), ('ng', 0.031), ('approximations', 0.031), ('least', 0.031), ('learns', 0.03), ('hi', 0.03), ('temporal', 0.03), ('functions', 0.029), ('discounted', 0.029), ('yxw', 0.029), ('km', 0.029), ('approximators', 0.029), ('displacement', 0.029), ('nc', 0.029), ('value', 0.029), ('position', 0.029), ('entirely', 0.027), ('initially', 0.027), ('ormoneit', 0.027), ('weighted', 0.027), ('approximation', 0.027), ('sampled', 0.027), ('sampling', 0.026), ('reuse', 0.025), ('experiences', 0.025), ('ef', 0.025), ('immediate', 0.024), ('discretized', 0.024), ('steps', 0.024), ('francisco', 0.024), ('pr', 0.023), ('uenced', 0.023), ('ht', 0.023), ('wh', 0.023), ('goal', 0.023), ('weights', 0.023), ('sample', 0.022), ('relatively', 0.022), ('states', 0.022), ('visited', 0.021), ('search', 0.021), ('giving', 0.021), ('maximizes', 0.021), ('direct', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
2 0.31928951 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
3 0.25567478 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
5 0.21477942 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
Author: Shie Mannor, Nahum Shimkin
Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1
6 0.21166278 40 nips-2001-Batch Value Function Approximation via Support Vectors
7 0.18711218 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
8 0.18134032 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
9 0.1722649 128 nips-2001-Multiagent Planning with Factored MDPs
10 0.16062079 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
11 0.12699188 36 nips-2001-Approximate Dynamic Programming via Linear Programming
12 0.097037748 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
13 0.078464806 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
14 0.077289678 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay
15 0.073093511 126 nips-2001-Motivated Reinforcement Learning
16 0.070798665 51 nips-2001-Cobot: A Social Reinforcement Learning Agent
17 0.067153335 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games
18 0.066504858 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
19 0.059617527 148 nips-2001-Predictive Representations of State
20 0.053196032 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
topicId topicWeight
[(0, -0.198), (1, -0.135), (2, 0.441), (3, 0.003), (4, 0.03), (5, 0.14), (6, -0.032), (7, -0.04), (8, 0.082), (9, 0.036), (10, 0.004), (11, -0.074), (12, 0.011), (13, -0.035), (14, 0.002), (15, -0.043), (16, -0.015), (17, -0.01), (18, -0.074), (19, 0.008), (20, -0.008), (21, 0.002), (22, 0.028), (23, -0.098), (24, 0.027), (25, -0.041), (26, -0.017), (27, -0.022), (28, 0.063), (29, 0.073), (30, -0.03), (31, -0.034), (32, -0.005), (33, 0.017), (34, 0.022), (35, 0.061), (36, -0.067), (37, -0.024), (38, -0.056), (39, -0.035), (40, 0.016), (41, 0.023), (42, 0.014), (43, -0.027), (44, 0.04), (45, 0.035), (46, 0.064), (47, 0.076), (48, -0.094), (49, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.95892882 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
2 0.85845029 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
3 0.81024474 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
4 0.77264035 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
Author: Rémi Munos
Abstract: It is desirable that a complex decision-making problem in an uncertain world be adequately modeled by a Markov Decision Process (MDP) whose structural representation is adaptively designed by a parsimonious resources allocation process. Resources include time and cost of exploration, amount of memory and computational time allowed for the policy or value function representation. Concerned about making the best use of the available resources, we address the problem of efficiently estimating where adding extra resources is highly needed in order to improve the expected performance of the resulting policy. Possible application in reinforcement learning (RL) , when real-world exploration is highly costly, concerns the detection of those areas of the state-space that need primarily to be explored in order to improve the policy. Another application concerns approximation of continuous state-space stochastic control problems using adaptive discretization techniques for which highly efficient grid points allocation is mandatory to survive high dimensionality. Maybe surprisingly these two problems can be formulated under a common framework: for a given resource allocation, which defines a belief state over possible MDPs, find where adding new resources (thus decreasing the uncertainty of some parameters -transition probabilities or rewards) will most likely increase the expected performance of the new policy. To do so, we use sampling techniques for estimating the contribution of each parameter's probability distribution function (Pdf) to the expected loss of using an approximate policy (such as the optimal policy of the most probable MDP) instead of the true (but unknown) policy.
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
6 0.7551769 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
7 0.75332296 40 nips-2001-Batch Value Function Approximation via Support Vectors
8 0.73529112 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
9 0.69250029 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
10 0.67444062 128 nips-2001-Multiagent Planning with Factored MDPs
11 0.65627152 36 nips-2001-Approximate Dynamic Programming via Linear Programming
12 0.54121566 177 nips-2001-Switch Packet Arbitration via Queue-Learning
13 0.44669601 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
14 0.36969909 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
15 0.36619845 51 nips-2001-Cobot: A Social Reinforcement Learning Agent
16 0.34310779 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games
17 0.30221692 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
18 0.30111519 126 nips-2001-Motivated Reinforcement Learning
19 0.2997472 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay
20 0.26973689 148 nips-2001-Predictive Representations of State
topicId topicWeight
[(14, 0.027), (17, 0.09), (19, 0.03), (27, 0.104), (30, 0.078), (36, 0.012), (38, 0.017), (59, 0.032), (72, 0.072), (79, 0.045), (83, 0.042), (86, 0.25), (91, 0.113)]
simIndex simValue paperId paperTitle
same-paper 1 0.79279995 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
2 0.63361156 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
4 0.6209501 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
Author: Evan Greensmith, Peter L. Bartlett, Jonathan Baxter
Abstract: We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary experiments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially observable Markov decision process (POMDP). Gradient ascent methods (e.g., [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal∗ Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good estimate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investigate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f : X → R, and we know the integral of another function ϕ : X → R. Since X f = X (f − ϕ) + X ϕ, the integral of f − ϕ can be estimated instead. Obviously if ϕ = f then the variance is zero. More generally, Var(f − ϕ) = Var(f ) − 2Cov(f, ϕ) + Var(ϕ), so that if φ and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The first (Section 2) is the technique of adding a baseline. We find the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actorcritic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the fixed value function and the true value function. In this paper we show that this may not be the best approach: selecting the fixed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a fixed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a fixed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a fixed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices {P(u) : u ∈ U}, each with components pij (u) for i, j ∈ S, u ∈ U, an observation process ν : S → PY , where PY is the space of probability distributions over Y, and a reward function r : S → R. We assume that S, U, Y are finite, although all our results extend easily to infinite U and Y, and with more restrictive assumptions can be extended to infinite S. A reactive, parameterized policy for a POMDP is a set of mappings {µ(·, θ) : Y → PU |θ ∈ RK }. Together with the POMDP, this defines the controlled POMDP (S, U, Y, P , ν, r, µ). The joint state, observation and control process, {Xt , Yt , Ut }, is Markov. The state process, {Xt }, is also Markov, with transition probabilities pij (θ) = y∈Y,u∈U νy (i)µu (y, θ)pij (u), where νy (i) denotes the probability of observation y given the state i, and µu (y, θ) denotes the probability of action u given parameters θ and observation y. The Markov chain M(θ) = (S, P(θ)) then describes the behaviour of the process {Xt }. Assumption 1 The controlled POMDP (S, U, Y, P , ν, r, µ) satisfies: For all θ ∈ RK there exists a unique stationary distribution satisfying π (θ) P(θ) = π (θ). There is an R < ∞ such that, for all i ∈ S, |r(i)| ≤ R. There is a B < ∞ such that, for all u ∈ U, y ∈ Y and θ ∈ RK the derivatives ∂µu (y, θ)/∂θk (1 ≤ k ≤ K) exist, and the vector of these derivatives satisfies µu (y, θ)/µu (y, θ) ≤ B, where · denotes the Euclidean norm on RK . def T −1 1 We consider the average reward, η(θ) = limT →∞ E T t=0 r(Xt ) . Assumption 1 implies that this limit exists, and does not depend on the start state X0 . The aim is to def select a policy to maximize this quantity. Define the discounted value function, J β (i, θ) = T −1 t limT →∞ E t=0 β r(Xt ) X0 = i . Throughout the rest of the paper, dependences upon θ are assumed, and dropped in the notation. For a random vector A, we denote Var(A) = E (A − E [A])2 , where a2 denotes a a, and a denotes the transpose of the column vector a. GPOMDP Algorithm The GPOMDP algorithm [4] uses a sample path to estimate the gradient approximation def µu(y) η, but the βη = E β η approaches the true gradient µu(y) Jβ (j) . As β → 1, def variance increases. We consider a slight modification [2]: with Jt = def ∆T = 1 T T −1 t=0 2T s=t µUt (Yt ) Jt+1 . µUt (Yt ) β s−t r(Xs ), (1) Throughout this paper the process {Xt , Yt , Ut , Xt+1 } is generally understood to be generated by a controlled POMDP satisfying Assumption 1, with X0 ∼π (ie the initial state distributed according to the stationary distribution). That is, before computing the gradient estimates, we wait until the process has settled down to the stationary distribution. Dependent Samples Correlation terms arise in the variance quantities to be analysed. We show here that considering iid samples gives an upper bound on the variance of the general case. The mixing time of a finite ergodic Markov chain M = (S, P ) is defined as def τ = min t > 1 : max dT V i,j Pt i , Pt j ≤ e−1 , where [P t ]i denotes the ith row of P t and dT V is the total variation distance, dT V (P, Q) = i |P (i) − Q(i)|. Theorem 1 Let M = (S, P ) be a finite ergodic Markov chain, with mixing time τ , and 2|S|e and 0 ≤ α < let π be its stationary distribution. There are constants L < exp(−1/(2τ )), which depend only on M , such that, for all f : S → R and all t, Covπ (t) ≤ Lαt Varπ (f), where Varπ (f) is the variance of f under π, and Covπ (t) is f f the auto-covariance of the process {f(Xt )}, where the process {Xt } is generated by M with initial distribution π. Hence, for some constant Ω∗ ≤ 4Lτ , Var 1 T T −1 f(Xt ) t=0 ≤ Ω∗ Varπ (f). T We call (L, τ ) the mixing constants of the Markov chain M (or of the controlled POMDP D; ie the Markov chain (S, P )). We omit the proof (all proofs are in the full version [8]). Briefly, we show that for a finite ergodic Markov chain M , Covπ (t) ≤ Rt (M )Varπ (f), f 2 t for some Rt (M ). We then show that Rt (M ) < 2|S| exp(− τ ). In fact, for a reversible chain, we can choose L = 1 and α = |λ2 |, the second largest magnitude eigenvalue of P . 2 Baseline We consider an alteration of (1), def ∆T = 1 T T −1 µUt (Yt ) (Jt+1 − Ar (Yt )) . µUt (Yt ) t=0 (2) For any baseline Ar : Y → R, it is easy to show that E [∆T ] = E [∆T ]. Thus, we select Ar to minimize variance. The following theorem shows that this variance is bounded by a variance involving iid samples, with Jt replaced by the exact value function. Theorem 2 Suppose that D = (S, U, Y, P , ν, r, µ) is a controlled POMDP satisfying Assumption 1, D has mixing constants (L, τ ), {Xt , Yt , Ut , Xt+1 } is a process generated by D with X0 ∼π ,Ar : Y → R is a baseline that is uniformly bounded by M, and J (j) has the distribution of ∞ β s r(Xt ), where the states Xt are generated by D starting in s=0 X0 = j. Then there are constants C ≤ 5B2 R(R + M) and Ω ≤ 4Lτ ln(eT ) such that Var 1 T T −1 t=0 µUt (Yt ) Ω (Jt+1 −Ar (Yt )) ≤ Varπ µUt (Yt ) T + Ω E T µu (y) (J (j) − Jβ (j)) µu (y) µu (y) (Jβ (j)−Ar (y)) µu (y) 2 + Ω +1 T C βT , (1 − β)2 where, as always, (i, y, u, j) are generated iid with i∼π, y∼ν(i), u∼µ(y) and j∼P i (u). The proof uses Theorem 1 and [2, Lemma 4.3]. Here we have bounded the variance of (2) with the variance of a quantity we may readily analyse. The second term on the right hand side shows the error associated with replacing an unbiased, uncorrelated estimate of the value function with the true value function. This quantity is not dependent on the baseline. The final term on the right hand side arises from the truncation of the discounted reward— and is exponentially decreasing. We now concentrate on minimizing the variance σ 2 = Varπ r µu (y) (Jβ (j) − Ar (y)) , µu (y) (3) which the following lemma relates to the variance σ 2 without a baseline, µu (y) Jβ (j) . µu (y) σ 2 = Varπ Lemma 3 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For any baseline Ar : Y → R, and for i∼π, y∼ν(i), u∼µ(y) and j∼Pi (u), σ 2 = σ 2 + E A2 (y) E r r µu (y) µu (y) 2 y − 2Ar (y)E µu (y) µu (y) 2 Jβ (j) y . From Lemma 3 it can be seen that the task of finding the optimal baseline is in effect that of minimizing a quadratic for each observation y ∈ Y. This gives the following theorem. Theorem 4 For the controlled POMDP as in Lemma 3, 2 µu (y) min σ 2 = σ 2 − E E Jβ (j) y r Ar µu (y) 2 /E µu (y) µu (y) 2 y and this minimum is attained with the baseline 2 µu (y) µu (y) A∗ (y) = E r Jβ (j) , 2 µu (y) µu (y) /E y y . Furthermore, the optimal constant baseline is µu (y) µu (y) A∗ = E r 2 Jβ (j) /E µu (y) µu (y) 2 . The following theorem shows that the variance of an estimate with an arbitrary baseline can be expressed as the sum of the variance with the optimal baseline and a certain squared weighted distance between the baseline function and the optimal baseline function. Theorem 5 If Ar : Y → R is a baseline function, A∗ is the optimal baseline defined in r Theorem 4, and σ 2 is the variance of the corresponding estimate, then r∗ µu (y) µu (y) σ 2 = σ2 + E r r∗ 2 (Ar (y) − A∗ (y)) r 2 , where i∼π, y ∼ν(i), and u∼µ(y). Furthermore, the same result is true for the case of constant baselines, with Ar (y) replaced by an arbitrary constant baseline Ar , and A∗ (y) r replaced by A∗ , the optimum constant baseline defined in Theorem 4. r For the constant baseline Ar = E i∼π [Jβ (i)], Theorem 5 implies that σ 2 is equal to r min Ar ∈R σ2 r + E µu (y) µu (y) 2 E [Jβ (j)] − E µu (y) µu (y) 2 2 /E Jβ (j) µu (y) µu (y) 2 . Thus, its performance depends on the random variables ( µu (y)/µu (y))2 and Jβ (j); if they are nearly independent, E [Jβ ] is a good choice. 3 Fixed Value Functions: Actor-Critic Methods We consider an alteration of (1), ˜ def 1 ∆T = T T −1 t=0 µUt (Yt ) ˜ Jβ (Xt+1 ), µUt (Yt ) (4) ˜ for some fixed value function Jβ : S → R. Define ∞ def β k d(Xk , Xk+1 ) Aβ (j) = E X0 = j , k=0 def ˜ ˜ where d(i, j) = r(i) + β Jβ (j) − Jβ (i) is the temporal difference. Then it is easy to show that the estimate (4) has a bias of µu (y) ˜ Aβ (j) . β η − E ∆T = E µu (y) The following theorem gives a bound on the expected squared error of (4). The main tool in the proof is Theorem 1. Theorem 6 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For a sample path from D, that is, {X0∼π, Yt∼ν(Xt ), Ut∼µ(Yt ), Xt+1∼PXt (Ut )}, E ˜ ∆T − βη 2 ≤ Ω∗ Varπ T µu (y) ˜ Jβ (j) + E µu (y) µu (y) Aβ (j) µu (y) 2 , where the second expectation is over i∼π, y∼ν(i), u∼µ(y), and j∼P i (u). ˜ If we write Jβ (j) = Jβ (j) + v(j), then by selecting v = (v(1), . . . , v(|S|)) from the right def null space of the K × |S| matrix G, where G = i,y,u πi νy (i) µu (y)Pi (u), (4) will produce an unbiased estimate of β η. An obvious example of such a v is a constant vector, (c, c, . . . , c) : c ∈ R. We can use this to construct a trivial example where (4) produces an unbiased estimate with zero variance. Indeed, let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1, with r(i) = c, for some 0 < c ≤ R. Then Jβ (j) = c/(1 − β) and β η = 0. If we choose v = (−c/(1 − β), . . . , −c/(1 − β)) and ˜ ˜ Jβ (j) = Jβ (j) + v(j), then µµu(y) Jβ (j) = 0 for all y, u, j, and so (4) gives an unbiased u(y) estimate of β η, with zero variance. Furthermore, for any D for which there exists a pair y, u such that µu (y) > 0 and µu (y) = 0, choosing ˜β (j) = Jβ (j) gives a variance J greater than zero—there is a non-zero probability event, (Xt = i, Yt = y, Ut = u, Xt+1 = j), such that µµu(y) Jβ (j) = 0. u(y) 4 Algorithms Given a parameterized class of baseline functions Ar (·, θ) : Y → R θ ∈ RL , we can use Theorem 5 to bound the variance of our estimates. Computing the gradient of this bound with respect to the parameters θ of the baseline function allows a gradient optimization of the baseline. The GDORB Algorithm produces an estimate ∆ S of these gradients from a sample path of length S. Under the assumption that the baseline function and its gradients are uniformly bounded, we can show that these estimates converge to the gradient of σ 2 . We omit the details (see [8]). r GDORB Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized baseline Ar . set z0 = 0 (z0 ∈ RL ), ∆0 = 0 (∆0 ∈ RL ) for all {is , ys , us , is+1 , ys+1 } generated by the POMDP do zs+1 = βzs + ∆s+1 = ∆s + end for Ar (ys ) 1 s+1 µus(ys ) µus(ys ) 2 ((Ar (ys ) − βAr (ys+1 ) − r(xs+1 )) zs+1 − ∆s ) ˜ For a parameterized class of fixed value functions {Jβ (·, θ) : S → R θ ∈ RL }, we can use Theorem 6 to bound the expected squared error of our estimates, and compute the gradient of this bound with respect to the parameters θ of the baseline function. The GBTE Algorithm produces an estimate ∆S of these gradients from a sample path of length S. Under the assumption that the value function and its gradients are uniformly bounded, we can show that these estimates converge to the true gradient. GBTE Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized fixed value function ˜β . J set z0 = 0 (z0 ∈ RK ), ∆A0 = 0 (∆A0 ∈ R1×L ), ∆B 0 = 0 (∆B 0 ∈ RK ), ∆C 0 = 0 (∆C 0 ∈ RK ) and ∆D0 = 0 (∆D0 ∈ RK×L ) for all {is , ys , us , is+1 , is+2 } generated by the POMDP do µ s(y ) zs+1 = βzs + µuu(yss ) s µus(ys ) ˜ µus(ys ) Jβ (is+1 ) µus(ys ) µus(ys ) ˜ Jβ (is+1 ) ∆As+1 = ∆As + 1 s+1 ∆B s+1 = ∆B s + 1 s+1 µus(ys ) ˜ µus(ys ) Jβ (is+1 ) ∆C s+1 = ∆C s + 1 s+1 ˜ ˜ r(is+1 ) + β Jβ (is+2 ) − Jβ (is+1 ) zs+1 − ∆C s ∆Ds+1 = ∆Ds + 1 s+1 µus(ys ) µus(ys ) ∆s+1 = end for Ω∗ T ∆As+1 − − ∆B s ˜ Jβ (is+1 ) Ω∗ T ∆B s+1 ∆D s+1 − ∆As − ∆D s − ∆C s+1 ∆Ds+1 5 Experiments Experimental results comparing these GPOMDP variants for a simple three state MDP (described in [5]) are shown in Figure 1. The exact value function plots show how different choices of baseline and fixed value function compare when all algorithms have access to the exact value function Jβ . Using the expected value function as a baseline was an improvement over GPOMDP. Using the optimum, or optimum constant, baseline was a further improvement, each performing comparably to the other. Using the pre-trained fixed value function was also an improvement over GPOMDP, showing that selecting the true value function was indeed not the best choice in this case. The trained fixed value function was not optimal though, as Jβ (j) − A∗ is a valid choice of fixed value function. r The optimum baseline, and fixed value function, will not normally be known. The online plots show experiments where the baseline and fixed value function were trained using online gradient descent whilst the performance gradient was being estimated, using the same data. Clear improvement over GPOMDP is seen for the online trained baseline variant. For the online trained fixed value function, improvement is seen until T becomes—given the simplicity of the system—very large. References [1] L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 968–974. MIT Press, 1999. [2] P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 2002. To appear. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846, 1983. [4] J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] J. Baxter, P. L. Bartlett, and L. Weaver. Infinite-horizon gradient-based policy search: II. Gradient ascent algorithms and experiments. Journal of Artificial Intelligence Research, 15:351–381, 2001. [6] M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford University Press, 2000. Exact Value Function—Mean Error Exact Value Function—One Standard Deviation 0.4 0.4 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain Relative Norm Difference Relative Norm Difference 0.25 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain 0.35 0.3 0.2 0.15 0.1 0.05 ¡ 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 T Online—Mean Error 100000 1e+06 1e+07 Online—One Standard Deviation 1 1 GPOMDP BL-online FVF-online 0.8 Relative Norm Difference Relative Norm Difference 10000 T 0.6 0.4 0.2 0 GPOMDP BL-online FVF-online 0.8 0.6 0.4 0.2 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 T 1000 10000 100000 1e+06 1e+07 T Figure 1: Three state experiments—relative norm error ∆ est − η / η . Exact value function plots compare mean error and standard deviations for gradient estimates (with knowledge of Jβ ) computed by: GPOMDP [GPOMDP-Jβ ]; with baseline Ar = [Jβ ] [BL- [Jβ ]]; with optimum baseline [BL-A∗ (y)]; with optimum constant baseline [BL-A∗ ]; with pre-trained fixed value r r function [FVF-pretrain]. Online plots do a similar comparison of estimates computed by: GPOMDP [GPOMDP]; with online trained baseline [BL-online]; with online trained fixed value function [FVFonline]. The plots were computed over 500 runs (1000 for FVF-online), with β = 0.95. Ω∗ /T was set to 0.001 for FVF-pretrain, and 0.01 for FVF-online. ¢ ¢ [7] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 1990. [8] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Technical report, ANU, 2002. [9] H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 152–160, 1997. [10] V. R. Konda and J. N. Tsitsiklis. Actor-Critic Algorithms. In Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press, 2000. [11] P. Marbach and J. N. Tsitsiklis. Simulation-Based Optimization of Markov Reward Processes. Technical report, MIT, 1998. [12] R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Ann. Oper. Res., 27:175–211, 1991. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. [14] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. [15] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992.
5 0.61972189 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
6 0.61399925 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
7 0.61106044 36 nips-2001-Approximate Dynamic Programming via Linear Programming
8 0.60757029 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
9 0.60752207 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
10 0.60627103 89 nips-2001-Grouping with Bias
11 0.60621572 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
12 0.60579574 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
13 0.6049462 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
14 0.6047982 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
15 0.60077345 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
16 0.6003291 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
17 0.59834808 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
18 0.59834409 171 nips-2001-Spectral Relaxation for K-means Clustering
19 0.59806049 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
20 0.597983 56 nips-2001-Convolution Kernels for Natural Language