nips nips2001 nips2001-55 nips2001-55-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eyal Even-dar, Yishay Mansour
Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1
[1] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9] Scientific, Belmont, MA, 1996. V.S. Borkar and S.P. Meyn. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. Siam J. control, 38 (2):447-69, 2000. R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. m IJCAI, 2001. E. Even-Dar and Y. Mansour. Learning rates for Q-Iearning. m COLT, 2001. J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments. Progress in Statistics, pages 241 -266, 1974. T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6, 1994. M. Kearns and S. Singh. Efficient reinforcement learning: theoretical framework and algorithms. In fCML, 1998. M. Littman and Cs. Szepesvari. A generalized reinforcement learning model: convergence and applications. m ICML, 1996. M.L Puterman. Markov Decision Processes - Discrete Stochastic Dynamic Programming. John Wiley & Sons. mc., New York, NY, 1994.
[10] R. S. Sutton and A. G. Bato. Reinforcement Learning. MIT press, 1998.
[11] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-Iearning. Machine Learning, 16:185-202, 1994.
[12] C. Watkins and P. Dyan. Q-Iearning. Machine Learning, 8(3/4):279 -292, 1992.
[13] C. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University, 1989.