nips nips2010 nips2010-66 nips2010-66-reference knowledge-graph by maker-knowledge-mining

66 nips-2010-Double Q-learning


Source: pdf

Author: Hado V. Hasselt

Abstract: In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation. 1


reference text

[1] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, England, 1989.

[2] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992. 8

[3] R. Bellman. Dynamic Programming. Princeton University Press, 1957.

[4] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:1185–1201, 1994.

[5] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.

[6] M. L. Littman and C. Szepesv´ ri. A generalized reinforcement-learning model: Convergence a and applications. In L. Saitta, editor, Proceedings of the 13th International Conference on Machine Learning (ICML-96), pages 310–318, Bari, Italy, 1996. Morgan Kaufmann.

[7] R. H. Crites and A. G. Barto. Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 1017–1023, Cambridge MA, 1996. MIT Press.

[8] W. D. Smart and L. P. Kaelbling. Effective reinforcement learning for mobile robots. In Proceedings of the 2002 IEEE International Conference on Robotics and Automation (ICRA 2002), pages 3404–3410, Washington, DC, USA, 2002.

[9] M. A. Wiering and H. P. van Hasselt. Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 38(4):930–936, 2008.

[10] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.

[11] M. J. Kearns and S. P. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Neural Information Processing Systems 12, pages 996–1002. MIT Press, 1999.

[12] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(1):503–556, 2005.

[13] C. Szepesv´ ri. The asymptotic convergence-rate of Q-learning. In NIPS ’97: Proceedings of a the 1997 conference on Advances in neural information processing systems 10, pages 1064– 1070, Cambridge, MA, USA, 1998. MIT Press.

[14] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5:1–25, 2003.

[15] E. Van den Steen. Rational overoptimism (and other biases). American Economic Review, 94(4):1141–1151, September 2004.

[16] J. E. Smith and R. L. Winkler. The optimizer’s curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52(3):311–322, 2006.

[17] E. Capen, R. Clapp, and T. Campbell. Bidding in high risk situations. Journal of Petroleum Technology, 23:641–653, 1971.

[18] R. H. Thaler. Anomalies: The winner’s curse. Journal of Economic Perspectives, 2(1):191– 202, Winter 1988.

[19] J. L. W. V. Jensen. Sur les fonctions convexes et les in´ galit´ s entre les valeurs moyennes. e e Journal Acta Mathematica, 30(1):175–193, 1906.

[20] S. P. Singh, T. Jaakkola, M. L. Littman, and C. Szepesv´ ri. Convergence results for single-step a on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287–308, 2000.

[21] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.

[22] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT press, Cambridge MA, 1998.

[23] S. Pandey, D. Chakrabarti, and D. Agarwal. Multi-armed bandit problems with dependent arms. In Proceedings of the 24th international conference on Machine learning, pages 721– 728. ACM, 2007.

[24] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, pages 97–109, 1970. 9