nips nips2001 nips2001-13 knowledge-graph by maker-knowledge-mining

13 nips-2001-A Natural Policy Gradient


Source: pdf

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. [sent-8, score-0.883]

2 Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. [sent-9, score-1.516]

3 These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [sent-10, score-0.975]

4 We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. [sent-12, score-0.163]

5 Such methods seek to find a good policy 7r among some restricted class of policies, by following the gradient of the future reward. [sent-14, score-0.949]

6 ()i = oJ] f / a()i is dimensionally inconsistent since the left hand side has units of ()i and the right hand side has units of l/()i (and all ()i do not necessarily have the same dimensions). [sent-18, score-0.091]

7 In this paper, we present a covariant gradient by defining a metric based on the underlying structure of the policy. [sent-19, score-0.618]

8 We make the connection to policy iteration by showing that the natural gradient is moving toward choosing a greedy optimal action. [sent-20, score-1.488]

9 We then analyze the performance of the natural gradient in both simple and complicated MDPs. [sent-21, score-0.636]

10 2 A Natural Gradient A finite MDP is a tuple (S, So, A, R, P) where: S is finite set of states, So is a start state, A is a finite set of actions, R is a reward function R : S x A --+ [0, Rmax], and P is the transition model. [sent-23, score-0.238]

11 The agent 's decision making procedure is characterized by a stochastic policy 7r(a; s) , which is the probability of taking action a in state s (a semi-colon is used to distinguish the random variables from the parameters of the distribution). [sent-24, score-0.647]

12 We make the assumption that every policy 7r is ergodic, ie has a well-defined stationary distribution p7f. [sent-25, score-0.701]

13 We consider the more difficult case where the goal of the agent is to find a policy that maximizes the average reward over some restricted class of smoothly parameterized policies, fr = {7rO : 8 E ~m}, where tro represents the policy 7r(a; S, 8). [sent-27, score-1.205]

14 The exact gradient of the average reward (see [8, 9]) is: \11](8) = Lp7f(s)\17r(a;s, 8)Q7f(s ,a) (1) s,a where we abuse notation by using 1](8) instead of 1](7ro). [sent-28, score-0.658]

15 The steepest descent direction of 1](8) is defined as the vector d8 that minimizes 1](8 + d8) under the constraint that the squared length Id812 is held to a small constant. [sent-29, score-0.379]

16 This squared length is defined with respect to some positive-definite matrix G(8), ie Id812 == 2::ij Gij (8)d8 i d8j = d8 T G(8)d8 (using vector notation). [sent-30, score-0.199]

17 The steepest descent direction is then given by G- 1 \11](8) [1]. [sent-31, score-0.298]

18 Standard gradient descent follows the direction \11](8) which is the steepest descent under the assumption that G(8) is the identity matrix, I. [sent-32, score-0.847]

19 However, this as hoc choice of a metric is not necessarily appropriate. [sent-33, score-0.267]

20 As suggested by Amari [1], it is better to define a metric based not on the choice of coordinates but rather on the manifold (ie the surface) that these coordinates parameterize. [sent-34, score-0.349]

21 Though we slightly abuse notation by writing 1](8), the average reward is technically a function on the set of distributions {7rO : 8 E ~m}. [sent-36, score-0.235]

22 To each state s, there corresponds a probability manifold, where the distribution 7r(a; S, 8) is a point on this manifold with coordinates 8. [sent-37, score-0.181]

23 As shown by Amari (see [1]), the Fisher information matrix, up to a scale, is an invariant metric on the space of the parameters of probability distributions. [sent-39, score-0.169]

24 It is invariant in the sense that it defines the same distance between two points regardless of the choice of coordinates (ie the parameterization) used, unlike G = I. [sent-40, score-0.117]

25 Since the average reward is defined on a set of these distributions , the straightforward choice we make for the metric is: (3) where the expectation is with respect to the stationary distribution of 7ro. [sent-41, score-0.464]

26 Notice that although each Fs is independent of the parameters of the MDP's transition model, the weighting by the stationary distribution introduces dependence on these parameters. [sent-42, score-0.098]

27 Intuitively, Fs (8) measures distance on a probability manifold corresponding to state sand F(8) is the average such distance. [sent-43, score-0.182]

28 The steepest descent direction this gives is: (4) 3 The Natural Gradient and Policy Iteration We now compare policy improvement under the natural gradient to policy iteration. [sent-44, score-1.979]

29 For an appropriate comparison, consider the case in which Q7r (s, a) is approximated by some compatible function approximator r(s ,a;w) parameterized by w [9, 6]. [sent-45, score-0.236]

30 This function approximator is compatible with the policy in the sense that if we use the approximations f7r (s, a; w) in lieu of their true values to compute the gradient (equation 1), then the result would still be exact [9, 6] (and is thus a sensible choice to use in actor-critic schemes). [sent-50, score-1.194]

31 Solving for w gives w = F(()) - l\71}(()), and the result follows from the definition of the natural gradient. [sent-61, score-0.218]

32 D Thus, sensible actor-critic frameworks (those using f7r(s , a; w)) are forced to use the natural gradient as the weights of a linear function approximator. [sent-62, score-0.636]

33 If the function approximation is accurate, then good actions (ie those with large state-action values) have feature vectors that have a large inner product with the natural gradient. [sent-63, score-0.21]

34 2 Greedy Policy Improvement A greedy policy improvement step using our function approximator would choose action a in state s if a E argmaxa, f7r (s, a'; w). [sent-65, score-1.057]

35 In this section, we show that the natural gradient tends to move toward this best action, rather than just a good action. [sent-66, score-0.739]

36 Let us first consider policies in the exponential family (n(a ;s, ()) IX exp(()T¢sa) where ¢sa is some feature vector in ~m). [sent-67, score-0.128]

37 The motivation for the exponential family is because it has affine geometry (ie the flat geometry of a plane), so a translation of a point by a tangent vector will keep the point on the manifold. [sent-68, score-0.106]

38 In general, crudely speaking, the probability manifold of 7r(a; s, 0) could be curved, so a translation of a point by a tangent vector would not necessarily keep the point on the manifold (such as on a sphere). [sent-69, score-0.339]

39 We now show, for the exponential family, that a sufficiently large step in the natural gradient direction will lead to a policy that is equivalent to a policy found after a greedy policy improvement step. [sent-71, score-2.42]

40 Since E 7r (a';s,O) (1)sa') is not a function of a, it follows that argmax a , r(s, a'; w) = argmax a , ~'TJ(Of 1>sa' . [sent-79, score-0.102]

41 After a gradient step, 7r(a; s, 0 + a~'TJ(O)) ex: exp(OT 1>sa + a~'TJ(O)T 1>sa). [sent-80, score-0.423]

42 It is in this sense that the natural gradient tends to move toward choosing the best action. [sent-83, score-0.776]

43 It is straightforward to show that if the standard non-covariant gradient rule is used instead then 7r oo (a; s) will select only a better action (not necessarily the best), ie it will choose an action a such that F'(s ,a;w) > E 7r (a';s){F'(s,a';w)}. [sent-84, score-0.995]

44 The following theorem shows that the natural gradient is locally moving toward the best action, determined by the local linear approximator for Q7r (s, a). [sent-87, score-0.91]

45 D It is interesting to note that choosing the greedy action will not in general improve the policy, and many detailed studies have gone into understanding this failure [3]. [sent-99, score-0.287]

46 However, with the overhead of a line search, we can guarantee improvement and move toward this greedy one step improvement. [sent-100, score-0.533]

47 Initial improvement is guaranteed since F is positive definite. [sent-101, score-0.157]

48 4 Metrics and Curvatures Obviously, our choice of F is not unique and the question arises as to whether or not there is a better metric to use than F. [sent-102, score-0.176]

49 In the different setting of parameter estimation, the Fisher information converges to the Hessian, so it is asymptotically efficient [1], ie attains the Cramer-Rao bound. [sent-103, score-0.257]

50 Our situation is more similar to the blind source separation case where a metric is chosen based on the underlying parameter space [1] (of non-singular matrices) and is not necessarily asymptotically efficient (ie does not attain second order convergence). [sent-104, score-0.496]

51 As argued by Mackay [7], one strategy is to pull a metric out of the data-independent terms of the Hessian (if possible), and in fact, Mackay [7] arrives at the same result as Amari for the blind source separation case. [sent-105, score-0.296]

52 Although the previous sections argued that our choice is appropriate, we would like to understand how F relates to the Hessian V 2 TJ(B), which, as shown in [5], has the form: sa (6) Unfortunately, all terms in this Hessian are data-dependent (ie are coupled to stateaction values) . [sent-106, score-0.313]

53 However, the Q values weight this curvature of our policy and our metric is neglecting such weighting. [sent-109, score-0.66]

54 Similar to the blind source separation case, our metric clearly does not necessarily converge to the Hessian and so it is not necessarily asymptotically efficient (ie does not attain a second order convergence rate). [sent-110, score-0.621]

55 Conjugate methods would be expected to be more efficient near a local maximum. [sent-112, score-0.094]

56 5 Experiments We first look at the performance of the natural gradient in a few simple MDPs before examining its performance in the more challenging MDP of Tetris. [sent-113, score-0.73]

57 We simulated the natural policy gradient in a simple I-dimensional linear quadratic regulator with dynamics x(t + 1) = . [sent-118, score-1.071]

58 The parameterized policy used was 7f(u; x, B) ex exp(Blx 2 + B2X). [sent-121, score-0.604]

59 Figure lA shows the performance improvement when the units of the parameters are scaled by a factor of 10 (see figure text). [sent-122, score-0.175]

60 The policy used was 7f(u; x, ()) ex: exp(()lslX2 + ()2S2X) where the rescaling constants, Sl and S2, are shown in the legend. [sent-149, score-0.529]

61 8) , the right-most three curves are generated using the standard gradient method and the rest use the natural gradient. [sent-151, score-0.624]

62 time (on a 107 scale) of a policy under standard gradient descent using the sigmoidal policy parameterization (7f(I; s, ()i) ex: exp(()i)/(1 + exp(()i)), with the initial conditions 7f(i , 1) = . [sent-154, score-1.643]

63 time (unscaled) under standard gradient descent (solid line) and natural gradient descent (dashed line) for an early window of the above plot. [sent-158, score-1.299]

64 D) Phase space plot for the standard gradient case (the solid line) and the natural gradient case (dashed line) . [sent-159, score-1.047]

65 This is because F is not an invariant metric due to the weighting by Ps. [sent-162, score-0.206]

66 Using a sigmoidal policy parameterization (see figure text) and initial conditions corresponding to p(i) = . [sent-165, score-0.569]

67 2, both self-loop action probabilities will initially be increased under a gradient rule (since one step policy improvement chooses the self-loop for each state). [sent-167, score-1.212]

68 Since the standard gradient weights the learning to each parameter by p(s) (see equation 1), the self-loop action at state i is increased faster than the self loop probability at j, which has the effect of decreasing the effective learning-rate to state j even further. [sent-168, score-0.719]

69 This leads to an extremely fiat plateau with average reward 1 (shown in Figure lC top), where the learning for state j is thwarted by its low stationary probability. [sent-169, score-0.348]

70 This problem is so severe that before the optimal policy is reached p(j) drops as low as 10- 7 from its initial value of . [sent-170, score-0.56]

71 Figure 1 C bottom shows the performance of the natural gradient (in a very early time window of Figure lC top). [sent-172, score-0.636]

72 Not only is the time to the optimal policy decreased by a factor of 107 , the stationary distribution of state i never drops below . [sent-173, score-0.638]

73 Note though the standard gradient does increase the average reward faster at the start, but only to be seduced by sticking at state i. [sent-175, score-0.75]

74 In general, if a table lookup Boltzmann policy is used (ie 7f( a; s , ()) ex: exp( () sa)), it is straightforward to show that the natural gradient weights the components of ~'fJ uniformly (instead of using p(s)), thus evening evening out the learning to all parameters. [sent-177, score-1.208]

75 The game of Tetris provides a challenging high dimensional problem. [sent-178, score-0.131]

76 As shown in [3], greedy policy iteration methods using a linear function approximator exhibit drastic performance degradation after providing impressive improvement (see [3] for a description of the game, methods , and results). [sent-179, score-1.162]

77 Tetris provides an interesting case to test gradient methods, A 5000, - - - - - - - - - - - - - , B 7000, - - - - - - - - - - , - - - - , C 6000 4000 5000 ~3000 ~4000 ·0 a. [sent-181, score-0.423]

78 The top curve duplicates the same results in [3] using the same features (which were simple functions of the heights of each column and the number of holes in the game). [sent-186, score-0.12]

79 We have no explanation for this performance degradation (nor does [3]). [sent-187, score-0.094]

80 The lower curve shows the poor performance of the standard gradient rule. [sent-188, score-0.622]

81 B) The curve on the right shows the natural policy gradient method (and uses the biased gradient method of [2] though this method alone gave poor performance). [sent-189, score-1.603]

82 We found we could obtain faster improvement and higher asymptotes if the robustifying factor of 10- 3 I that we added to F was more carefully controlled (we did not carefully control the parameters). [sent-190, score-0.252]

83 C) Due to the intensive computational power required of these simulations we ran the gradient in a smaller Tetris game (height of 10 rather than 20) to demonstrate that the standard gradient updates (right curve) would eventually reach the same performance of the natural gradient (left curve). [sent-191, score-1.609]

84 We consider a policy compatible with the linear function approximator used in [3] (ie 7f(a ;s, (}) ex: exp((}T¢sa) where ¢sa are the same feature vectors). [sent-193, score-0.681]

85 The lower curve in Figure 2A shows the particularly poor performance of the standard gradient method. [sent-195, score-0.622]

86 In an attempt to speed learning, we tried a variety of more sophisticated methods to no avail, such as conjugate methods, weight decay, annealing, the variance reduction method of [2], the Hessian in equation 6, etc. [sent-196, score-0.094]

87 Figure 2B shows a drastic improvement using the natural gradient (note that the timescale is linear). [sent-197, score-0.778]

88 This performance is consistent with our theoretical results in section 3, which showed that the natural gradient is moving toward the solution of a greedy policy improvement step. [sent-198, score-1.57]

89 The performance is somewhat slower than the greedy policy iteration (left curve in Figure 2B) which is to be expected using smaller steps. [sent-199, score-0.813]

90 However, the policy does not degrade with a gradient method. [sent-200, score-0.955]

91 Figure 2 shows that the performance of the standard gradient rule (right curve) eventually reaches the the same performance of the natural gradient, in a scaled down version of the game (see figure text). [sent-201, score-0.853]

92 6 Discussion Although gradient methods cannot make large policy changes compared to greedy policy iteration, section 3 implies that these two methods might not be that disparate, since a natural gradient method is moving toward the solution of a policy improvement step. [sent-202, score-2.994]

93 With the overhead of a line search, the methods are even more similar. [sent-203, score-0.119]

94 The benefit is that performance improvement is now guaranteed, unlike in a greedy policy iteration step. [sent-204, score-0.86]

95 It is interesting, and unfortunate, to note that the F does not asymptotically converge to the Hessian, so conjugate gradient methods might be more sensible asymptotically. [sent-205, score-0.651]

96 However, far from the converge point, the Hessian is not necessarily informative, and the natural gradient could be more efficient (as demonstrated in Tetris). [sent-206, score-0.764]

97 The intuition as to why the natural gradient could be efficient far from the maximum, is that it is pushing the policy toward choosing greedy optimal actions. [sent-207, score-1.427]

98 Sufficiently close to the maximum, little performance change occurs (due to the small gradient), so although conjugate methods might converge faster near the maximum, the corresponding performance change might be negligible. [sent-209, score-0.272]

99 More experimental work is necessary to further understand the effectiveness of the natural gradient. [sent-210, score-0.162]

100 Policy gradient methods for reinforcement learning with function approximation. [sent-259, score-0.5]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.486), ('gradient', 0.423), ('sa', 0.239), ('natural', 0.162), ('ie', 0.154), ('reward', 0.145), ('hessian', 0.143), ('greedy', 0.143), ('metric', 0.137), ('steepest', 0.127), ('descent', 0.126), ('improvement', 0.124), ('toward', 0.122), ('tetris', 0.117), ('approximator', 0.11), ('action', 0.107), ('necessarily', 0.091), ('game', 0.088), ('compatible', 0.085), ('manifold', 0.081), ('ex', 0.077), ('curve', 0.077), ('mdp', 0.073), ('drastic', 0.069), ('exp', 0.067), ('stationary', 0.061), ('moving', 0.059), ('covariant', 0.058), ('vtj', 0.058), ('blind', 0.058), ('policies', 0.057), ('iteration', 0.056), ('definition', 0.056), ('amari', 0.055), ('efficient', 0.054), ('state', 0.054), ('conjugate', 0.054), ('performance', 0.051), ('argmax', 0.051), ('sensible', 0.051), ('crudely', 0.051), ('evening', 0.051), ('asymptotically', 0.049), ('actions', 0.048), ('mdps', 0.048), ('parameterization', 0.048), ('average', 0.047), ('sham', 0.046), ('degrade', 0.046), ('coordinates', 0.046), ('height', 0.045), ('direction', 0.045), ('squared', 0.045), ('heights', 0.043), ('abuse', 0.043), ('argmaxa', 0.043), ('carefully', 0.043), ('degradation', 0.043), ('rescaling', 0.043), ('challenging', 0.043), ('faster', 0.042), ('parameterized', 0.041), ('plateau', 0.041), ('attain', 0.041), ('fs', 0.041), ('ot', 0.041), ('overhead', 0.041), ('methods', 0.04), ('choice', 0.039), ('rule', 0.039), ('family', 0.039), ('fisher', 0.039), ('standard', 0.039), ('line', 0.038), ('reinforcement', 0.037), ('weighting', 0.037), ('curvature', 0.037), ('drops', 0.037), ('severe', 0.037), ('choosing', 0.037), ('minimizes', 0.036), ('argued', 0.035), ('sigmoidal', 0.035), ('lc', 0.035), ('tangent', 0.035), ('separation', 0.035), ('straightforward', 0.035), ('theorem', 0.034), ('converge', 0.034), ('gatsby', 0.034), ('guaranteed', 0.033), ('dayan', 0.033), ('step', 0.033), ('exponential', 0.032), ('mackay', 0.032), ('poor', 0.032), ('invariant', 0.032), ('move', 0.032), ('source', 0.031), ('finite', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

2 0.31928951 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

3 0.31493998 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.       ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢  ¢

4 0.27806839 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

5 0.26602462 59 nips-2001-Direct value-approximation for factored MDPs

Author: Dale Schuurmans, Relu Patrascu

Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1

6 0.23119552 40 nips-2001-Batch Value Function Approximation via Support Vectors

7 0.20782547 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

8 0.19506894 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

9 0.18813409 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

10 0.16524795 128 nips-2001-Multiagent Planning with Factored MDPs

11 0.16067512 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

12 0.13794161 36 nips-2001-Approximate Dynamic Programming via Linear Programming

13 0.12424953 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

14 0.10810811 8 nips-2001-A General Greedy Approximation Algorithm with Applications

15 0.10419775 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures

16 0.093764707 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

17 0.091767386 126 nips-2001-Motivated Reinforcement Learning

18 0.091220766 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

19 0.08615867 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

20 0.086075939 139 nips-2001-Online Learning with Kernels


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.279), (1, -0.133), (2, 0.483), (3, 0.018), (4, 0.075), (5, 0.138), (6, -0.026), (7, -0.071), (8, 0.096), (9, -0.036), (10, 0.033), (11, -0.033), (12, -0.004), (13, -0.06), (14, 0.011), (15, -0.019), (16, -0.022), (17, -0.01), (18, -0.084), (19, 0.06), (20, -0.004), (21, 0.05), (22, 0.043), (23, -0.052), (24, 0.084), (25, -0.059), (26, -0.023), (27, 0.037), (28, -0.046), (29, 0.141), (30, 0.014), (31, 0.013), (32, -0.03), (33, -0.031), (34, -0.002), (35, 0.007), (36, -0.019), (37, 0.054), (38, -0.038), (39, -0.071), (40, -0.059), (41, -0.009), (42, 0.033), (43, 0.028), (44, 0.078), (45, 0.094), (46, 0.152), (47, 0.039), (48, -0.052), (49, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98305869 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

2 0.89913321 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

3 0.82619536 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.       ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢  ¢

4 0.79949731 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

5 0.77606553 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

Author: Xin Wang, Thomas G. Dietterich

Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1

6 0.74525142 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

7 0.70277739 59 nips-2001-Direct value-approximation for factored MDPs

8 0.68226403 40 nips-2001-Batch Value Function Approximation via Support Vectors

9 0.66865247 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

10 0.58564425 128 nips-2001-Multiagent Planning with Factored MDPs

11 0.55696416 36 nips-2001-Approximate Dynamic Programming via Linear Programming

12 0.54230011 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

13 0.51079875 177 nips-2001-Switch Packet Arbitration via Queue-Learning

14 0.46908706 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

15 0.39804369 148 nips-2001-Predictive Representations of State

16 0.39100498 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

17 0.36965162 126 nips-2001-Motivated Reinforcement Learning

18 0.35032642 76 nips-2001-Fast Parameter Estimation Using Green's Functions

19 0.34393498 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

20 0.34378934 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.066), (17, 0.091), (19, 0.065), (20, 0.013), (27, 0.142), (30, 0.077), (38, 0.019), (46, 0.121), (59, 0.021), (72, 0.07), (79, 0.065), (83, 0.039), (91, 0.13)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91344529 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

2 0.85168946 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

3 0.84939194 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.       ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢  ¢

4 0.84046125 36 nips-2001-Approximate Dynamic Programming via Linear Programming

Author: Daniela Farias, Benjamin V. Roy

Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach

5 0.83987772 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

Author: Marzia Polito, Pietro Perona

Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1

6 0.83892274 89 nips-2001-Grouping with Bias

7 0.83746898 59 nips-2001-Direct value-approximation for factored MDPs

8 0.83745044 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

9 0.83452511 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

10 0.83352053 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

11 0.8321771 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

12 0.83149028 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

13 0.82896835 137 nips-2001-On the Convergence of Leveraging

14 0.8279435 171 nips-2001-Spectral Relaxation for K-means Clustering

15 0.82758093 8 nips-2001-A General Greedy Approximation Algorithm with Applications

16 0.82472128 56 nips-2001-Convolution Kernels for Natural Language

17 0.82466686 58 nips-2001-Covariance Kernels from Bayesian Generative Models

18 0.82439017 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

19 0.8238225 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

20 0.82355291 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes