nips nips2009 nips2009-60 nips2009-60-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1
Antos, A., Szepesv´ ri, Cs. & Munos, R. (2008). Learning near-optimal policies with Bellmana residual minimization based fitted policy iteration and a single sample path. Machine Learning 71: 89–129. Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Borkar, V. S. & Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization 38(2): 447–469. Boyan, J. A. & Moore, A.W. (1995). Generalization in Reinforcement Learning: Safely Approximating the Value Function. In Advances in Neural Information Processing Systems 7, pp. 369–376, MIT Press. Crites, R. H. & Barto, A.G. (1995). Improving Elevator Performance Using Reinforcement Learning In Advances in Neural Information Processing Systems 8, pp. 1017-1023. MIT Press. Farahmand, A.m., Ghavamzadeh, M., Szepesvari, C. & Mannor, S. (2009). Regularized Policy Iteration In Advances in Neural Information Processing Systems 21, pp. 441–448. Kushner, H. J. & Yin, G. G. (2003). Stochastic Approximation Algorithms and Applications. Second Edition, Springer-Verlag. Pearlmutter, B. A (1994). Fast exact multiplication by the Hessian. Neural Computation 6(1), pp. 147–160. Silver, D. (2009). Reinforcement Learning and Simulation-Based Search in Computer Go. University of Alberta Ph.D. thesis. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning 3:9–44. Sutton, R. S. & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R. S., Szepesv´ ri, Cs. & Maei, H. R. (2009a). A convergent O(n) algorithm for off-policy a temporal-difference learning with linear function approximation. In Advances in Neural Information Processing Systems 21, pp. 1609–1616. MIT Press. Sutton, R. S., Maei, H. R, Precup, D., Bhatnagar, S., Silver, D., Szepesv´ ri, Cs. & Wiewiora, a E. (2009b). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, pp. 993–1000. Omnipress. Tesauro, G. (1992) Practical issues in temporal difference learning. Machine Learning 8: 257-277. Tsitsiklis, J. N. & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690. Zhang, W. & Dietterich, T. G. (1995) A reinforcement learning approach to job-shop scheduling. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 11141120. AAAI Press. 9