jmlr jmlr2009 jmlr2009-79 jmlr2009-79-reference knowledge-graph by maker-knowledge-mining

79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

Source: pdf

Author: Alexander L. Strehl, Lihong Li, Michael L. Littman

Abstract: We study the problem of learning near-optimal behavior in ﬁnite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP” algorithms include the wellknown E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a uniﬁed theoretical framework. A more reﬁned analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. Keywords: reinforcement learning, Markov decision processes, PAC-MDP, exploration, sample complexity

reference text

Dana Angluin. Queries and concept learning. Machine Learning, 2(4):319–342, 1988. John Asmuth, Michael L. Littman, and Robert Zinkov. Potential-based shaping in model-based reinforcement learning. In Proceedings of the Twenty-Third AAAI Conference on Artiﬁcial Intelligence, pages 604–609. AAAI Press, 2008. John Asmuth, Lihong Li, Michael L. Littman, Ali Nouri, and David Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, 2009. Christopher G. Atkeson and Geoffrey J. Gordon, editors. Proceedings of the ICML-97 Workshop on Modelling in Reinforcement Learning, 1997. URL http://www.cs.cmu.edu/˜ ggordon/ml97ws/. Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems 21, pages 89–96, 2009. Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamic programming. Artiﬁcial Intelligence, 72(1):81–138, 1995. Ronen I. Brafman and Moshe Tennenholtz. R-MAX—a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2002. 2441 S TREHL , L I , AND L ITTMAN Emma Brunskill, Bethany R. Lefﬂer, Lihong Li, Michael L. Littman, and Nicholas Roy. CORL: A continuous-state offset-dynamics reinforcement learner. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artiﬁcial Intelligence, pages 53–61, 2008. An extended version appears in the Journal of Machine Learning Research, volume 10, pages 1955–1988, 2009. Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, MA, 1990. Carlos Diuk, Lihong Li, and Bethany R. Lefﬂer. The adaptive k-meteorologists problem and its application to structure discovery and feature selection in reinforcement learning. In Proceedings of the Twenty-Sixth International Conference on Machine Learning, pages 249–256, 2009. Michael O. Duff and Andrew G. Barto. Local bandit approximation for optimal learning problems. In Advances in Neural Information Processing Systems, volume 9, pages 1019–1025. The MIT Press, 1997. Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5:1–25, 2003. Claude-Nicolas Fiechter. Efﬁcient reinforcement learning. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pages 88–97. Association of Computing Machinery, 1994. John C. Gittins. Multi-armed Bandit Allocation Indices. Wiley Interscience Series in Systems and Optimization. John Wiley & Sons Inc, Chichester, NY, 1989. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. Sham M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. Michael J. Kearns and Daphne Koller. Efﬁcient reinforcement learning in factored MDPs. In Proceedings of the Sixteenth International Joint Conference on Artiﬁcial Intelligence, pages 740– 747, 1999. Michael J. Kearns and Satinder P. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in Neural Information Processing Systems 11, pages 996–1002. The MIT Press, 1999. Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2–3):209–232, 2002. Sven Koenig and Reid G. Simmons. The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Machine Learning, 22(1–3):227–250, 1996. Bethany R. Lefﬂer, Michael L. Littman, Alexander L. Strehl, and Thomas J. Walsh. Efﬁcient exploration with latent structure. In Robotics: Science and Systems, pages 81–88, 2005. 2442 R EINFORCEMENT L EARNING IN F INITE MDP S : PAC A NALYSIS Bethany R. Lefﬂer, Michael L. Littman, and Timothy Edmunds. Efﬁcient reinforcement learning with relocatable action models. In Proceedings of the Twenty-Second Conference on Artiﬁcial Intelligence, pages 572–577, 2007. Lihong Li. A Unifying Framework for Computational Reinforcement Learning Theory. PhD thesis, Rutgers University, New Brunswick, NJ, 2009. Lihong Li, Michael L. Littman, and Christopher R. Mansley. Online exploration in least-squares policy iteration. In Proceedings of the Eighteenth International Conference on Agents and Multiagent Systems, pages 733–739, 2009. Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithms. Machine Learning, 2(4):285–318, 1988. Shie Mannor and John N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623–648, 2004. Colin McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics, volume 141 of London Mathematical Society Lecture Notes, pages 148–188. Cambridge University Press, 1989. R´ mi Munos and Andrew W. Moore. Rates of convergence for variable resolution schemes in optie mal control. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 647–654, 2000. Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 278–287, 1999. Ali Nouri and Michael L. Littman. Multi-resolution exploration in continuous spaces. In Advances in Neural Information Processing Systems 21, pages 1209–1216, 2009. Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the Twenty-Third International Conference on Machine Learning, pages 697–704, 2006. Martin L. Puterman. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994. Stuart J. Russell and Peter Norvig. Artiﬁcial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 1994. ISBN 0-13-103805-2. Satinder P. Singh and Richard C. Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994. Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model-based interval estimation. In Proceedings of the Twenty-second International Conference on Machine Learning, pages 857–864, 2005. 2443 S TREHL , L I , AND L ITTMAN Alexander L. Strehl and Michael L. Littman. Online linear regression and its application to modelbased reinforcement learning. In Advances in Neural Information Processing Systems 20, pages 1417–1424, 2008a. Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008b. Alexander L. Strehl, Lihong Li, and Michael L. Littman. Incremental model-based learners with formal learning-time guarantees. In Proceedings of the Twenty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 485–493, 2006a. Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC model-free reinforcement learning. In Proceedings of the Twenty-Third International Conference on Machine learning, pages 881–888, 2006b. Alexander L. Strehl, Carlos Diuk, and Michael L. Littman. Efﬁcient structure learning in factoredstate MDPs. In Proceedings of the Twenty-Second AAAI Conference on Artiﬁcial Intelligence, pages 645–650, 2007. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. Csaba Szepesv´ ri. The asymptotic convergence-rate of Q-learning. In Advances in Neural Infora mation Processing Systems 10, pages 1064–1070, 1998. L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, November 1984. Thomas J. Walsh, Istv´ n Szita, Carlos Diuk, and Michael L. Littman. Exploring compact a reinforcement-learning representations with linear regression. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, 2009. Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992. Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J. Weinberger. Inequalities for the L1 deviation of the empirical distribution. Technical Report HPL-2003-97R1, Hewlett-Packard Labs, 2003. 2444