nips nips2003 nips2003-158 nips2003-158-reference knowledge-graph by maker-knowledge-mining

158 nips-2003-Policy Search by Dynamic Programming


Source: pdf

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1


reference text

[1] E. Amaldi and V. Kann. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Comp. Sci., 1998.

[2] C. Atkeson and J. Morimoto. Non-parametric representation of a policies and value functions: A trajectory based approach. In NIPS 15, 2003.

[3] F. Gomez. http://www.cs.utexas.edu/users/nn/pages/software/software.html.

[4] Sham Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003.

[5] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proc. 19th International Conference on Machine Learning, 2002.

[6] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. Approximate planning in large POMDPs via reusable trajectories. (extended version of paper in NIPS 12), 1999.

[7] M. Littman. Memoryless policies: theoretical limitations and practical results. In Proc. 3rd Conference on Simulation of Adaptive Behavior, 1994.

[8] Andrew Y. Ng and Michael I. Jordan. P EGASUS: A policy search method for large MDPs and POMDPs. In Proc. 16th Conf. Uncertainty in Artificial Intelligence, 2000.

[9] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. 3 In our setting, we use weighted logistic regression and minimize − (θ) = P (i) − i w log p(y (i) |s(i) , θ) where p(y = 1|s, θ) = 1/(1 + exp(−θ T s)). It is straightforward to show that this is a (convex) upper-bound on the objective function T (θ).