nips nips2009 nips2009-159 nips2009-159-reference knowledge-graph by maker-knowledge-mining

159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control


Source: pdf

Author: Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári

Abstract: In this paper we introduce a multi-step linear Dyna-style planning algorithm. The key element of the multi-step linear Dyna is a multi-step linear model that enables multi-step projection of a sampled feature and multi-step planning based on the simulated multi-step transition experience. We propose two multi-step linear models. The first iterates the one-step linear model, but is generally computationally complex. The second interpolates between the one-step model and the infinite-step model (which turns out to be the LSTD solution), and can be learned efficiently online. Policy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than single-step linear Dyna, and generally learns faster as the number of projection steps increases. Results on Mountain-car show that multi-step linear Dyna leads to much better online performance than single-step linear Dyna and model-free algorithms; however, the performance of multi-step linear Dyna does not always improve as the number of projection steps increases. Our results also suggest that previous attempts on extending LSTD for online control were unsuccessful because LSTD looks infinite steps into the future, and suffers from the model errors in non-stationary (control) environments.


reference text

Bertsekas, D. P., Borkar, V., & Nedich, A. (2004). Improved temporal difference methods with linear function approximation. Learning and Approximate Dynamic Programming (pp. 231–255). IEEE Press. Boyan, J. A. (1999). Least-squares temporal difference learning. ICML-16. Bradtke, S., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57. Li, L., Littman, M. L., & Mansley, C. R. (2009). Online exploration in least-squares policy iteration. AAMAS-8. Peters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71, 1180–1190. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. ICML-7. Sutton, R. S. (1995). TD models: modeling the world at a mixture of time scales. ICML-12. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press. Sutton, R. S., Szepesv´ ri, C., Geramifard, A., & Bowling, M. (2008). Dyna-style planning with a linear function approximation and prioritized sweeping. UAI-24. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674–690. Xu, X., He, H., & Hu, D. (2002). Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 16, 259–292. 9