nips nips2001 nips2001-13 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. [sent-8, score-0.883]
2 Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. [sent-9, score-1.516]
3 These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [sent-10, score-0.975]
4 We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. [sent-12, score-0.163]
5 Such methods seek to find a good policy 7r among some restricted class of policies, by following the gradient of the future reward. [sent-14, score-0.949]
6 ()i = oJ] f / a()i is dimensionally inconsistent since the left hand side has units of ()i and the right hand side has units of l/()i (and all ()i do not necessarily have the same dimensions). [sent-18, score-0.091]
7 In this paper, we present a covariant gradient by defining a metric based on the underlying structure of the policy. [sent-19, score-0.618]
8 We make the connection to policy iteration by showing that the natural gradient is moving toward choosing a greedy optimal action. [sent-20, score-1.488]
9 We then analyze the performance of the natural gradient in both simple and complicated MDPs. [sent-21, score-0.636]
10 2 A Natural Gradient A finite MDP is a tuple (S, So, A, R, P) where: S is finite set of states, So is a start state, A is a finite set of actions, R is a reward function R : S x A --+ [0, Rmax], and P is the transition model. [sent-23, score-0.238]
11 The agent 's decision making procedure is characterized by a stochastic policy 7r(a; s) , which is the probability of taking action a in state s (a semi-colon is used to distinguish the random variables from the parameters of the distribution). [sent-24, score-0.647]
12 We make the assumption that every policy 7r is ergodic, ie has a well-defined stationary distribution p7f. [sent-25, score-0.701]
13 We consider the more difficult case where the goal of the agent is to find a policy that maximizes the average reward over some restricted class of smoothly parameterized policies, fr = {7rO : 8 E ~m}, where tro represents the policy 7r(a; S, 8). [sent-27, score-1.205]
14 The exact gradient of the average reward (see [8, 9]) is: \11](8) = Lp7f(s)\17r(a;s, 8)Q7f(s ,a) (1) s,a where we abuse notation by using 1](8) instead of 1](7ro). [sent-28, score-0.658]
15 The steepest descent direction of 1](8) is defined as the vector d8 that minimizes 1](8 + d8) under the constraint that the squared length Id812 is held to a small constant. [sent-29, score-0.379]
16 This squared length is defined with respect to some positive-definite matrix G(8), ie Id812 == 2::ij Gij (8)d8 i d8j = d8 T G(8)d8 (using vector notation). [sent-30, score-0.199]
17 The steepest descent direction is then given by G- 1 \11](8) [1]. [sent-31, score-0.298]
18 Standard gradient descent follows the direction \11](8) which is the steepest descent under the assumption that G(8) is the identity matrix, I. [sent-32, score-0.847]
19 However, this as hoc choice of a metric is not necessarily appropriate. [sent-33, score-0.267]
20 As suggested by Amari [1], it is better to define a metric based not on the choice of coordinates but rather on the manifold (ie the surface) that these coordinates parameterize. [sent-34, score-0.349]
21 Though we slightly abuse notation by writing 1](8), the average reward is technically a function on the set of distributions {7rO : 8 E ~m}. [sent-36, score-0.235]
22 To each state s, there corresponds a probability manifold, where the distribution 7r(a; S, 8) is a point on this manifold with coordinates 8. [sent-37, score-0.181]
23 As shown by Amari (see [1]), the Fisher information matrix, up to a scale, is an invariant metric on the space of the parameters of probability distributions. [sent-39, score-0.169]
24 It is invariant in the sense that it defines the same distance between two points regardless of the choice of coordinates (ie the parameterization) used, unlike G = I. [sent-40, score-0.117]
25 Since the average reward is defined on a set of these distributions , the straightforward choice we make for the metric is: (3) where the expectation is with respect to the stationary distribution of 7ro. [sent-41, score-0.464]
26 Notice that although each Fs is independent of the parameters of the MDP's transition model, the weighting by the stationary distribution introduces dependence on these parameters. [sent-42, score-0.098]
27 Intuitively, Fs (8) measures distance on a probability manifold corresponding to state sand F(8) is the average such distance. [sent-43, score-0.182]
28 The steepest descent direction this gives is: (4) 3 The Natural Gradient and Policy Iteration We now compare policy improvement under the natural gradient to policy iteration. [sent-44, score-1.979]
29 For an appropriate comparison, consider the case in which Q7r (s, a) is approximated by some compatible function approximator r(s ,a;w) parameterized by w [9, 6]. [sent-45, score-0.236]
30 This function approximator is compatible with the policy in the sense that if we use the approximations f7r (s, a; w) in lieu of their true values to compute the gradient (equation 1), then the result would still be exact [9, 6] (and is thus a sensible choice to use in actor-critic schemes). [sent-50, score-1.194]
31 Solving for w gives w = F(()) - l\71}(()), and the result follows from the definition of the natural gradient. [sent-61, score-0.218]
32 D Thus, sensible actor-critic frameworks (those using f7r(s , a; w)) are forced to use the natural gradient as the weights of a linear function approximator. [sent-62, score-0.636]
33 If the function approximation is accurate, then good actions (ie those with large state-action values) have feature vectors that have a large inner product with the natural gradient. [sent-63, score-0.21]
34 2 Greedy Policy Improvement A greedy policy improvement step using our function approximator would choose action a in state s if a E argmaxa, f7r (s, a'; w). [sent-65, score-1.057]
35 In this section, we show that the natural gradient tends to move toward this best action, rather than just a good action. [sent-66, score-0.739]
36 Let us first consider policies in the exponential family (n(a ;s, ()) IX exp(()T¢sa) where ¢sa is some feature vector in ~m). [sent-67, score-0.128]
37 The motivation for the exponential family is because it has affine geometry (ie the flat geometry of a plane), so a translation of a point by a tangent vector will keep the point on the manifold. [sent-68, score-0.106]
38 In general, crudely speaking, the probability manifold of 7r(a; s, 0) could be curved, so a translation of a point by a tangent vector would not necessarily keep the point on the manifold (such as on a sphere). [sent-69, score-0.339]
39 We now show, for the exponential family, that a sufficiently large step in the natural gradient direction will lead to a policy that is equivalent to a policy found after a greedy policy improvement step. [sent-71, score-2.42]
40 Since E 7r (a';s,O) (1)sa') is not a function of a, it follows that argmax a , r(s, a'; w) = argmax a , ~'TJ(Of 1>sa' . [sent-79, score-0.102]
41 After a gradient step, 7r(a; s, 0 + a~'TJ(O)) ex: exp(OT 1>sa + a~'TJ(O)T 1>sa). [sent-80, score-0.423]
42 It is in this sense that the natural gradient tends to move toward choosing the best action. [sent-83, score-0.776]
43 It is straightforward to show that if the standard non-covariant gradient rule is used instead then 7r oo (a; s) will select only a better action (not necessarily the best), ie it will choose an action a such that F'(s ,a;w) > E 7r (a';s){F'(s,a';w)}. [sent-84, score-0.995]
44 The following theorem shows that the natural gradient is locally moving toward the best action, determined by the local linear approximator for Q7r (s, a). [sent-87, score-0.91]
45 D It is interesting to note that choosing the greedy action will not in general improve the policy, and many detailed studies have gone into understanding this failure [3]. [sent-99, score-0.287]
46 However, with the overhead of a line search, we can guarantee improvement and move toward this greedy one step improvement. [sent-100, score-0.533]
47 Initial improvement is guaranteed since F is positive definite. [sent-101, score-0.157]
48 4 Metrics and Curvatures Obviously, our choice of F is not unique and the question arises as to whether or not there is a better metric to use than F. [sent-102, score-0.176]
49 In the different setting of parameter estimation, the Fisher information converges to the Hessian, so it is asymptotically efficient [1], ie attains the Cramer-Rao bound. [sent-103, score-0.257]
50 Our situation is more similar to the blind source separation case where a metric is chosen based on the underlying parameter space [1] (of non-singular matrices) and is not necessarily asymptotically efficient (ie does not attain second order convergence). [sent-104, score-0.496]
51 As argued by Mackay [7], one strategy is to pull a metric out of the data-independent terms of the Hessian (if possible), and in fact, Mackay [7] arrives at the same result as Amari for the blind source separation case. [sent-105, score-0.296]
52 Although the previous sections argued that our choice is appropriate, we would like to understand how F relates to the Hessian V 2 TJ(B), which, as shown in [5], has the form: sa (6) Unfortunately, all terms in this Hessian are data-dependent (ie are coupled to stateaction values) . [sent-106, score-0.313]
53 However, the Q values weight this curvature of our policy and our metric is neglecting such weighting. [sent-109, score-0.66]
54 Similar to the blind source separation case, our metric clearly does not necessarily converge to the Hessian and so it is not necessarily asymptotically efficient (ie does not attain a second order convergence rate). [sent-110, score-0.621]
55 Conjugate methods would be expected to be more efficient near a local maximum. [sent-112, score-0.094]
56 5 Experiments We first look at the performance of the natural gradient in a few simple MDPs before examining its performance in the more challenging MDP of Tetris. [sent-113, score-0.73]
57 We simulated the natural policy gradient in a simple I-dimensional linear quadratic regulator with dynamics x(t + 1) = . [sent-118, score-1.071]
58 The parameterized policy used was 7f(u; x, B) ex exp(Blx 2 + B2X). [sent-121, score-0.604]
59 Figure lA shows the performance improvement when the units of the parameters are scaled by a factor of 10 (see figure text). [sent-122, score-0.175]
60 The policy used was 7f(u; x, ()) ex: exp(()lslX2 + ()2S2X) where the rescaling constants, Sl and S2, are shown in the legend. [sent-149, score-0.529]
61 8) , the right-most three curves are generated using the standard gradient method and the rest use the natural gradient. [sent-151, score-0.624]
62 time (on a 107 scale) of a policy under standard gradient descent using the sigmoidal policy parameterization (7f(I; s, ()i) ex: exp(()i)/(1 + exp(()i)), with the initial conditions 7f(i , 1) = . [sent-154, score-1.643]
63 time (unscaled) under standard gradient descent (solid line) and natural gradient descent (dashed line) for an early window of the above plot. [sent-158, score-1.299]
64 D) Phase space plot for the standard gradient case (the solid line) and the natural gradient case (dashed line) . [sent-159, score-1.047]
65 This is because F is not an invariant metric due to the weighting by Ps. [sent-162, score-0.206]
66 Using a sigmoidal policy parameterization (see figure text) and initial conditions corresponding to p(i) = . [sent-165, score-0.569]
67 2, both self-loop action probabilities will initially be increased under a gradient rule (since one step policy improvement chooses the self-loop for each state). [sent-167, score-1.212]
68 Since the standard gradient weights the learning to each parameter by p(s) (see equation 1), the self-loop action at state i is increased faster than the self loop probability at j, which has the effect of decreasing the effective learning-rate to state j even further. [sent-168, score-0.719]
69 This leads to an extremely fiat plateau with average reward 1 (shown in Figure lC top), where the learning for state j is thwarted by its low stationary probability. [sent-169, score-0.348]
70 This problem is so severe that before the optimal policy is reached p(j) drops as low as 10- 7 from its initial value of . [sent-170, score-0.56]
71 Figure 1 C bottom shows the performance of the natural gradient (in a very early time window of Figure lC top). [sent-172, score-0.636]
72 Not only is the time to the optimal policy decreased by a factor of 107 , the stationary distribution of state i never drops below . [sent-173, score-0.638]
73 Note though the standard gradient does increase the average reward faster at the start, but only to be seduced by sticking at state i. [sent-175, score-0.75]
74 In general, if a table lookup Boltzmann policy is used (ie 7f( a; s , ()) ex: exp( () sa)), it is straightforward to show that the natural gradient weights the components of ~'fJ uniformly (instead of using p(s)), thus evening evening out the learning to all parameters. [sent-177, score-1.208]
75 The game of Tetris provides a challenging high dimensional problem. [sent-178, score-0.131]
76 As shown in [3], greedy policy iteration methods using a linear function approximator exhibit drastic performance degradation after providing impressive improvement (see [3] for a description of the game, methods , and results). [sent-179, score-1.162]
77 Tetris provides an interesting case to test gradient methods, A 5000, - - - - - - - - - - - - - , B 7000, - - - - - - - - - - , - - - - , C 6000 4000 5000 ~3000 ~4000 ·0 a. [sent-181, score-0.423]
78 The top curve duplicates the same results in [3] using the same features (which were simple functions of the heights of each column and the number of holes in the game). [sent-186, score-0.12]
79 We have no explanation for this performance degradation (nor does [3]). [sent-187, score-0.094]
80 The lower curve shows the poor performance of the standard gradient rule. [sent-188, score-0.622]
81 B) The curve on the right shows the natural policy gradient method (and uses the biased gradient method of [2] though this method alone gave poor performance). [sent-189, score-1.603]
82 We found we could obtain faster improvement and higher asymptotes if the robustifying factor of 10- 3 I that we added to F was more carefully controlled (we did not carefully control the parameters). [sent-190, score-0.252]
83 C) Due to the intensive computational power required of these simulations we ran the gradient in a smaller Tetris game (height of 10 rather than 20) to demonstrate that the standard gradient updates (right curve) would eventually reach the same performance of the natural gradient (left curve). [sent-191, score-1.609]
84 We consider a policy compatible with the linear function approximator used in [3] (ie 7f(a ;s, (}) ex: exp((}T¢sa) where ¢sa are the same feature vectors). [sent-193, score-0.681]
85 The lower curve in Figure 2A shows the particularly poor performance of the standard gradient method. [sent-195, score-0.622]
86 In an attempt to speed learning, we tried a variety of more sophisticated methods to no avail, such as conjugate methods, weight decay, annealing, the variance reduction method of [2], the Hessian in equation 6, etc. [sent-196, score-0.094]
87 Figure 2B shows a drastic improvement using the natural gradient (note that the timescale is linear). [sent-197, score-0.778]
88 This performance is consistent with our theoretical results in section 3, which showed that the natural gradient is moving toward the solution of a greedy policy improvement step. [sent-198, score-1.57]
89 The performance is somewhat slower than the greedy policy iteration (left curve in Figure 2B) which is to be expected using smaller steps. [sent-199, score-0.813]
90 However, the policy does not degrade with a gradient method. [sent-200, score-0.955]
91 Figure 2 shows that the performance of the standard gradient rule (right curve) eventually reaches the the same performance of the natural gradient, in a scaled down version of the game (see figure text). [sent-201, score-0.853]
92 6 Discussion Although gradient methods cannot make large policy changes compared to greedy policy iteration, section 3 implies that these two methods might not be that disparate, since a natural gradient method is moving toward the solution of a policy improvement step. [sent-202, score-2.994]
93 With the overhead of a line search, the methods are even more similar. [sent-203, score-0.119]
94 The benefit is that performance improvement is now guaranteed, unlike in a greedy policy iteration step. [sent-204, score-0.86]
95 It is interesting, and unfortunate, to note that the F does not asymptotically converge to the Hessian, so conjugate gradient methods might be more sensible asymptotically. [sent-205, score-0.651]
96 However, far from the converge point, the Hessian is not necessarily informative, and the natural gradient could be more efficient (as demonstrated in Tetris). [sent-206, score-0.764]
97 The intuition as to why the natural gradient could be efficient far from the maximum, is that it is pushing the policy toward choosing greedy optimal actions. [sent-207, score-1.427]
98 Sufficiently close to the maximum, little performance change occurs (due to the small gradient), so although conjugate methods might converge faster near the maximum, the corresponding performance change might be negligible. [sent-209, score-0.272]
99 More experimental work is necessary to further understand the effectiveness of the natural gradient. [sent-210, score-0.162]
100 Policy gradient methods for reinforcement learning with function approximation. [sent-259, score-0.5]
wordName wordTfidf (topN-words)
[('policy', 0.486), ('gradient', 0.423), ('sa', 0.239), ('natural', 0.162), ('ie', 0.154), ('reward', 0.145), ('hessian', 0.143), ('greedy', 0.143), ('metric', 0.137), ('steepest', 0.127), ('descent', 0.126), ('improvement', 0.124), ('toward', 0.122), ('tetris', 0.117), ('approximator', 0.11), ('action', 0.107), ('necessarily', 0.091), ('game', 0.088), ('compatible', 0.085), ('manifold', 0.081), ('ex', 0.077), ('curve', 0.077), ('mdp', 0.073), ('drastic', 0.069), ('exp', 0.067), ('stationary', 0.061), ('moving', 0.059), ('covariant', 0.058), ('vtj', 0.058), ('blind', 0.058), ('policies', 0.057), ('iteration', 0.056), ('definition', 0.056), ('amari', 0.055), ('efficient', 0.054), ('state', 0.054), ('conjugate', 0.054), ('performance', 0.051), ('argmax', 0.051), ('sensible', 0.051), ('crudely', 0.051), ('evening', 0.051), ('asymptotically', 0.049), ('actions', 0.048), ('mdps', 0.048), ('parameterization', 0.048), ('average', 0.047), ('sham', 0.046), ('degrade', 0.046), ('coordinates', 0.046), ('height', 0.045), ('direction', 0.045), ('squared', 0.045), ('heights', 0.043), ('abuse', 0.043), ('argmaxa', 0.043), ('carefully', 0.043), ('degradation', 0.043), ('rescaling', 0.043), ('challenging', 0.043), ('faster', 0.042), ('parameterized', 0.041), ('plateau', 0.041), ('attain', 0.041), ('fs', 0.041), ('ot', 0.041), ('overhead', 0.041), ('methods', 0.04), ('choice', 0.039), ('rule', 0.039), ('family', 0.039), ('fisher', 0.039), ('standard', 0.039), ('line', 0.038), ('reinforcement', 0.037), ('weighting', 0.037), ('curvature', 0.037), ('drops', 0.037), ('severe', 0.037), ('choosing', 0.037), ('minimizes', 0.036), ('argued', 0.035), ('sigmoidal', 0.035), ('lc', 0.035), ('tangent', 0.035), ('separation', 0.035), ('straightforward', 0.035), ('theorem', 0.034), ('converge', 0.034), ('gatsby', 0.034), ('guaranteed', 0.033), ('dayan', 0.033), ('step', 0.033), ('exponential', 0.032), ('mackay', 0.032), ('poor', 0.032), ('invariant', 0.032), ('move', 0.032), ('source', 0.031), ('finite', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
2 0.31928951 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
4 0.27806839 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
Author: Shie Mannor, Nahum Shimkin
Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1
5 0.26602462 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
6 0.23119552 40 nips-2001-Batch Value Function Approximation via Support Vectors
7 0.20782547 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
8 0.19506894 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
9 0.18813409 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
10 0.16524795 128 nips-2001-Multiagent Planning with Factored MDPs
11 0.16067512 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
12 0.13794161 36 nips-2001-Approximate Dynamic Programming via Linear Programming
13 0.12424953 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
14 0.10810811 8 nips-2001-A General Greedy Approximation Algorithm with Applications
15 0.10419775 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures
16 0.093764707 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
17 0.091767386 126 nips-2001-Motivated Reinforcement Learning
18 0.091220766 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games
19 0.08615867 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
20 0.086075939 139 nips-2001-Online Learning with Kernels
topicId topicWeight
[(0, -0.279), (1, -0.133), (2, 0.483), (3, 0.018), (4, 0.075), (5, 0.138), (6, -0.026), (7, -0.071), (8, 0.096), (9, -0.036), (10, 0.033), (11, -0.033), (12, -0.004), (13, -0.06), (14, 0.011), (15, -0.019), (16, -0.022), (17, -0.01), (18, -0.084), (19, 0.06), (20, -0.004), (21, 0.05), (22, 0.043), (23, -0.052), (24, 0.084), (25, -0.059), (26, -0.023), (27, 0.037), (28, -0.046), (29, 0.141), (30, 0.014), (31, 0.013), (32, -0.03), (33, -0.031), (34, -0.002), (35, 0.007), (36, -0.019), (37, 0.054), (38, -0.038), (39, -0.071), (40, -0.059), (41, -0.009), (42, 0.033), (43, 0.028), (44, 0.078), (45, 0.094), (46, 0.152), (47, 0.039), (48, -0.052), (49, 0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.98305869 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
2 0.89913321 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
4 0.79949731 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
Author: Shie Mannor, Nahum Shimkin
Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1
5 0.77606553 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
Author: Xin Wang, Thomas G. Dietterich
Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1
6 0.74525142 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
7 0.70277739 59 nips-2001-Direct value-approximation for factored MDPs
8 0.68226403 40 nips-2001-Batch Value Function Approximation via Support Vectors
9 0.66865247 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
10 0.58564425 128 nips-2001-Multiagent Planning with Factored MDPs
11 0.55696416 36 nips-2001-Approximate Dynamic Programming via Linear Programming
12 0.54230011 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
13 0.51079875 177 nips-2001-Switch Packet Arbitration via Queue-Learning
14 0.46908706 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
15 0.39804369 148 nips-2001-Predictive Representations of State
16 0.39100498 51 nips-2001-Cobot: A Social Reinforcement Learning Agent
17 0.36965162 126 nips-2001-Motivated Reinforcement Learning
18 0.35032642 76 nips-2001-Fast Parameter Estimation Using Green's Functions
19 0.34393498 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
20 0.34378934 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games
topicId topicWeight
[(14, 0.066), (17, 0.091), (19, 0.065), (20, 0.013), (27, 0.142), (30, 0.077), (38, 0.019), (46, 0.121), (59, 0.021), (72, 0.07), (79, 0.065), (83, 0.039), (91, 0.13)]
simIndex simValue paperId paperTitle
same-paper 1 0.91344529 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
2 0.85168946 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
4 0.84046125 36 nips-2001-Approximate Dynamic Programming via Linear Programming
Author: Daniela Farias, Benjamin V. Roy
Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach
5 0.83987772 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
Author: Marzia Polito, Pietro Perona
Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1
6 0.83892274 89 nips-2001-Grouping with Bias
7 0.83746898 59 nips-2001-Direct value-approximation for factored MDPs
8 0.83745044 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
9 0.83452511 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
10 0.83352053 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
11 0.8321771 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
12 0.83149028 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
13 0.82896835 137 nips-2001-On the Convergence of Leveraging
14 0.8279435 171 nips-2001-Spectral Relaxation for K-means Clustering
15 0.82758093 8 nips-2001-A General Greedy Approximation Algorithm with Applications
16 0.82472128 56 nips-2001-Convolution Kernels for Natural Language
17 0.82466686 58 nips-2001-Covariance Kernels from Bayesian Generative Models
18 0.82439017 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
19 0.8238225 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
20 0.82355291 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes