jmlr jmlr2010 jmlr2010-79 jmlr2010-79-reference knowledge-graph by maker-knowledge-mining

79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

Source: pdf

Author: Thomas Jaksch, Ronald Ortner, Peter Auer

Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a ﬁxed number of ℓ times. We present a modiﬁcation of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity

reference text

Peter Auer and Ronald Ortner. Logarithmic online regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems 19 (NIPS 2006), pages 49–56. MIT Press, 2007. 1598 N EAR - OPTIMAL R EGRET B OUNDS FOR R EINFORCEMENT L EARNING Peter Auer and Ronald Ortner. Online regret bounds for a new reinforcement learning algorithm. In Proceedings 1st Austrian Cognitive Vision Workshop (ACVW 2005), pages 35–42. ÖCG, 2005. Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi-armed bandit problem. Mach. Learn., 47:235–256, 2002a. Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32:48–77, 2002b. Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2009), 2009. Ronen I. Brafman and Moshe Tennenholtz. R-max – a general polynomial time algorithm for nearoptimal reinforcement learning. J. Mach. Learn. Res., 3:213–231, 2002. Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for Markov decision processes. Math. Oper. Res., 22(1):222–255, 1997. Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Experts in a Markov decision process. In Advances in Neural Information Processing Systems 17 (NIPS 2004), pages 401–408. MIT Press, 2005. Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Online Markov decision processes. Math. Oper. Res., 34(3):726–736, 2009. Claude-Nicolas Fiechter. Efﬁcient reinforcement learning. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (COLT 1994), pages 88–97. ACM, 1994. David A. Freedman. On tail probabilities for martingales. Ann. Probab., 3:100–118, 1975. Aurélien Garivier and Eric Moulines. On upper-conﬁdence bound policies for non-stationary bandit problems. Preprint, 2008. URL http://arxiv.org/pdf/0805.3415. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13–30, 1963. Sham M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003. Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49:209–232, 2002. Michael J. Kearns and Satinder P. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in Neural Information Processing Systems 11 (NIPS 1998), pages 996–1002. MIT Press, 1999. Shie Mannor and John N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res., 5:623–648, 2004. 1599 JAKSCH , O RTNER AND AUER Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994. Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model-based interval estimation. In Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005), pages 857–864. ACM, 2005. Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for Markov decision processes. J. Comput. System Sci., 74(8):1309–1331, 2008. Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC model-free reinforcement learning. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), pages 881–888. ACM, 2006. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. Ambuj Tewari and Peter Bartlett. Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Advances in Neural Information Processing Systems 20 (NIPS 2007), pages 1505–1512. MIT Press, 2008. Ambuj Tewari and Peter L. Bartlett. Bounded parameter Markov decision processes with average reward criterion. In Learning Theory, 20th Annual Conference on Learning Theory (COLT 2007), pages 263–277, 2007. Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marco L. Weinberger. Inequalities for the L1 deviation of the empirical distribution. Technical Report HPL-2003-97, HP Laboratories Palo Alto, 2003. URL www.hpl.hp.com/techreports/2003/HPL-2003-97R1. pdf. Jia Yuan Yu and Shie Mannor. Piecewise-stationary bandit problems with side observations. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pages 1177–1184, 2009. Jia Yuan Yu, Shie Mannor, and Nahum Shimkin. Markov decision processes with arbitrary reward processes. Math. Oper. Res., 34(3):737–757, 2009. 1600