nips nips2008 nips2008-150 nips2008-150-reference knowledge-graph by maker-knowledge-mining

150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning

Source: pdf

Author: Peter Auer, Thomas Jaksch, Ronald Ortner

Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s in at most D steps (on average). We present a rein√ ˜ forcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm. 1

reference text

[1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

[2] Michael J. Kearns and Satinder P. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in Neural Information Processing Systems 11. MIT Press, 1999.

[3] Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49:209–232, 2002.

[4] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994.

[5] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems 19, pages 49–56. MIT Press, 2007.

[6] Ronen I. Brafman and Moshe Tennenholtz. R-max – a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213–231, 2002.

[7] Ambuj Tewari and Peter Bartlett. Optimistic linear programming gives logarithmic regret for irreducible mdps. In Advances in Neural Information Processing Systems 20, pages 1505–1512. MIT Press, 2008.

[8] Sham M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003.

[9] Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model-based interval estimation. In Proc. 22nd ICML 2005, pages 857–864, 2005.

[10] Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for Markov decision processes. J. Comput. System Sci., 74(8):1309–1331, 2008.

[11] Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for Markov decision processes. Math. Oper. Res., 22(1):222–255, 1997.

[12] Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Experts in a Markov decision process. In Advances in Neural Information Processing Systems 17, pages 401–408. MIT Press, 2005.

[13] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. Technical Report CIT-2009-01, University of Leoben, Chair for Information Technology, 2009. http://institute.unileoben.ac.at/infotech/publications/TR/CIT-2009-01.pdf.