nips nips2001 nips2001-121 nips2001-121-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
[1] J. Baxter and P.Bartlett. Reinforcement learning in POMDP’s via direct gradient ascent. In Proc. 17th International Conf. on Machine Learning, pages 41–48. Morgan Kaufmann, San Francisco, CA, 2000.
[2] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, Massachusetts, 1996.
[3] Justin A. Boyan. Least-squares temporal difference learning. In I. Bratko and S. Dzeroski, editors, Machine Learning: Proceedings of the Sixteenth International Conference, pages 49– 56. Morgan Kaufmann, San Francisco, CA, 1999.
[4] S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1/2/3):33–57, 1996.
[5] D. Koller and R. Parr. Policy iteration for factored mdps. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI-00). Morgan Kaufmann, 2000.
[6] V. Konda and J. Tsitsiklis. Actor-critic algorithms. In NIPS 2000 editors, editor, Advances in Neural Information Processing Systems 12: Proceedings of the 1999 Conference. MIT Press, 2000.
[7] M. G. Lagoudakis and R. Parr. Model-Free Least-Squares policy iteration. Technical Report CS-2001-05, Department of Computer Science, Duke University, December 2001.
[8] A. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI-00). Morgan Kaufmann, 2000.
[9] A. Ng, R. Parr, and D. Koller. Policy search via density estimation. In Advances in Neural Information Processing Systems 12: Proceedings of the 1999 Conference. MIT Press, 2000.
[10] Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: theory and application to reward shaping. In Proc. 16th International Conf. on Machine Learning, pages 278–287. Morgan Kaufmann, San Francisco, CA, 1999.
[11] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. To appear, Machine Learning, 2001.
[12] J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In The Fifteenth International Conference on Machine Learning, 1998. Morgan Kaufmann.
[13] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12: Proceedings of the 1999 Conference, 2000. MIT Press.