nips nips2011 nips2011-268 nips2011-268-reference knowledge-graph by maker-knowledge-mining

268 nips-2011-Speedy Q-Learning

Source: pdf

Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos

Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efﬁcient than the incremental methods like Q-learning. 1

reference text

[1] A. Antos, R. Munos, and Cs. Szepesv´ ri. Fitted Q-iteration in continuous action-space MDPs. a In Proceedings of the 21st Annual Conference on Neural Information Processing Systems, 2007.

[2] M. Gheshlaghi Azar, R. Munos, M. Ghavamzadeh, and H.J. Kappen. Reinforcement learning with a near optimal rate of convergence. Technical Report inria-00636615, INRIA, 2011.

[3] P. L. Bartlett and A. Tewari. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence, 2009.

[4] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume II. Athena Scientiﬁc, Belmount, Massachusetts, third edition, 2007.

[5] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, Massachusetts, 1996.

[6] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.

[7] E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit and Markov decision processes. In 15th Annual Conference on Computational Learning Theory, pages 255–270, 2002.

[8] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5:1–25, 2003.

[9] W. Feller. An Introduction to Probability Theory and Its Applications, volume 1. Wiley, 1968.

[10] T. Jaakkola, M. I. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic programming. Neural Computation, 6(6):1185–1201, 1994.

[11] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.

[12] M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in Neural Information Processing Systems 12, pages 996–1002. MIT Press, 1999.

[13] R. Munos and Cs. Szepesv´ ri. Finite-time bounds for ﬁtted value iteration. Journal of Machine a Learning Research, 9:815–857, 2008.

[14] J. Peng and R. J. Williams. Incremental multi-step Q-learning. Machine Learning, 22(13):283–290, 1996.

[15] A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning in ﬁnite MDPs: PAC analysis. Journal of Machine Learning Research, 10:2413–2444, 2009.

[16] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts, 1998.

[17] Cs. Szepesv´ ri. The asymptotic convergence-rate of Q-learning. In Advances in Neural Infora mation Processing Systems 10, Denver, Colorado, USA, 1997, 1997.

[18] I. Szita and Cs. Szepesv´ ri. Model-based reinforcement learning with nearly tight exploration a complexity bounds. In Proceedings of the 27th International Conference on Machine Learning, pages 1031–1038. Omnipress, 2010.

[19] H. van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems 23, pages 2613–2621, 2010.

[20] C. Watkins. Learning from Delayed Rewards. PhD thesis, Kings College, Cambridge, England, 1989. 9