nips nips2007 nips2007-102 nips2007-102-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
Bagnell, J., & Schneider, J. (2003). Covariant policy search. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence. Barto, A. G., Sutton, R. S., & Anderson, C. (1983). Neuron-like elements that can solve difficult learning control problems. IEEE Transaction on Systems, Man and Cybernetics, 13, 835–846. Baxter, J., & Bartlett, P. (2001). Infinite-horizon policy-gradient estimation. JAIR, 15, 319–350. Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2007). Natural actor-critic algorithms. Submitted to Automatica. Greensmith, E., Bartlett, P., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5, 1471–1530. Kakade, S. (2002). A natural policy gradient. Proceedings of NIPS 14. Konda, V., & Tsitsiklis, J. (2000). Actor-critic algorithms. Proceedings of NIPS 12 (pp. 1008–1014). Marbach, P. (1998). Simulated-based methods for Markov decision processes. Doctoral dissertation, MIT. Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots. Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. Proceedings of the Sixteenth European Conference on Machine Learning (pp. 280–291). Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Doctoral dissertation, UMass Amherst. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. Proceedings of NIPS 12 (pp. 1057–1063). 8