jmlr jmlr2012 jmlr2012-34 jmlr2012-34-reference knowledge-graph by maker-knowledge-mining

34 jmlr-2012-Dynamic Policy Programming

Source: pdf

Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen

Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the inﬁnite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove ﬁnite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be signiﬁcantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation

reference text

A. Antos, R. Munos, and C. Szepesv´ ri. Fitted Q-iteration in continuous action-space MDPs. In a Advances in Neural Information Processing Systems 20, pages 9–16. MIT Press, 2008. M. Gheshlaghi Azar, R. Munos, M. Ghavamzadeh, and H. J. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems 24, pages 2411–2419. MIT Press, 2012. J. A. Bagnell and J. G. Schneider. Covariant policy search. In Proceedings of the 18th International Joint Conference on Artiﬁcial Intelligence, pages 1019–1024. Morgan Kaufmann, 2003. P. L. Bartlett and A. Tewari. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 35–42. AUAI Press, 2009. A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difﬁcult learning control problems. Systems, Man and Cybernetics, IEEE Transactions on, SMC–13(5): 834–846, 1983. 3242 DYNAMIC P OLICY P ROGRAMMING J. Baxter and P. L. Bartlett. Inﬁnite-horizon policy-gradient estimation. Journal of Artiﬁcial Intelligence Research, 15(1):319–350, 2001. D. P. Bertsekas. Dynamic Programming and Optimal Control, volume II. Athena Scientiﬁc, third edition, 2007. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996. S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. C. Daniel, G. Neumann, and J. Peters. Hierarchical relative entropy policy search. Journal of Machine Learning Research - Proceedings Track, 22:273–281, 2012. D. P. de Farias and B. Van Roy. On the existence of ﬁxed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3):589–608, 2000. D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005. E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003. A. Farahmand, M. Ghavamzadeh, Cs. Szepesv´ ri, and S. Mannor. Regularized ﬁtted Q-iteration: a Application to planning. In European Workshop on Reinforcement Learning, Lecture Notes in Computer Science, pages 55–68. Springer, 2008. A. Farahmand, M. Ghavamzadeh, Cs. Szepesv´ ri, and S. Mannor. Regularized policy iteration. In a Advances in Neural Information Processing Systems 21, pages 441–448. Curran Associates, Inc., 2009. A. Farahmand, R. Munos, and Cs. Szepesv´ ri. Error propagation for approximate policy and value a iteration. In Advances in Neural Information Processing Systems 23, pages 568–576. MIT Press, 2010. J. Hoffmann-Jørgensen and G. Pisier. The law of large numbers and the central limit theorem in banach spaces. The Annals of Probability, 4(4):587–599, 1976. T. Jaakkola, M. I. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic programming. Neural Computation, 6(6):1185–1201, 1994. T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010. S. Kakade. Natural policy gradient. In Advances in Neural Information Processing Systems 14, pages 1531–1538. MIT Press, 2002. 3243 ´ G HESHLAGHI A ZAR , G OMEZ AND K APPEN H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Statistical Mechanics, 2005(11):P11011, 2005. M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in Neural Information Processing Systems 12, pages 996–1002. MIT Press, 1999. J. Kober and J. Peters. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems 21, pages 849–856. MIT Press, 2009. S. Koenig and R. G. Simmons. Complexity analysis of real-time reinforcement learning. In Proceedings of the Eleventh National Conference on Artiﬁcial Intelligence, pages 99–105. AAAI Press, 1993. V. Konda and J. N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4):1143–1166, 2003. M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4(Dec):1107–1149, 2003. D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. H. Maei, Cs. Szepesv´ ri, S. Bhatnagar, and R. S. Sutton. Toward off-policy learning control with a function approximation. In Proceedings of the 27th International Conference on Machine Learning, pages 719–726. Omnipress, 2010. F. Melo, S. Meyn, and I. Ribeiro. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th International Conference on Machine Learning, pages 664–671. ACM Press, 2008. R. Munos. Error bounds for approximate value iteration. In Proceedings of the 20th national conference on Artiﬁcial intelligence - Volume 2, pages 1006–1011. AAAI Press, 2005. R. Munos and Cs. Szepesv´ ri. Finite-time bounds for ﬁtted value iteration. Journal of Machine a Learning Research, 9(May):815–857, 2008. T. J. Perkins and D. Precup. A convergent form of approximate policy iteration. In Advances in Neural Information Processing Systems 15, pages 1595–1602. MIT Press, 2003. J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71(7–9):1180–1190, 2008. J. Peters, K. M¨ lling, and Y. Altun. Relative entropy policy search. In Proceedings of the Twentyu Fourth AAAI Conference on Artiﬁcial Intelligence, pages 1607–1612. AAAI Press, 2010. S. Singh, T. Jaakkola, M.L. Littman, and Cs. Szepesv´ ri. Convergence results for single-step ona policy reinforcement-learning algorithms. Machine Learning, 38(3):287–308, 2000. S. Still and D. Precup. An information-theoretic approach to curiosity-driven reinforcement learning. Theory in Biosciences, 131(3):139–148, 2012. 3244 DYNAMIC P OLICY P ROGRAMMING A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning in ﬁnite MDPs: PAC analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. Cs. Szepesv´ ri. The asymptotic convergence-rate of Q-learning. In Advances in Neural Information a Processing Systems 10, pages 1064–1070. MIT Press, 1998. Cs. Szepesv´ ri. Algorithms for Reinforcement Learning. Synthesis Lectures on Artiﬁcial Intellia gence and Machine Learning. Morgan & Claypool Publishers, 2010. Cs. Szepesv´ ri and W. Smart. Interpolation-based Q-learning. In Proceedings of 21st International a Conference on Machine Learning, pages 791–798. ACM Press, 2004. I. Szita and Cs. Szepesv´ ri. Model-based reinforcement learning with nearly tight exploration coma plexity bounds. In Proceedings of the 27th International Conference on Machine Learning, pages 1031–1038. Omnipress, 2010. C. Thiery and B. Scherrer. Least-squares lambda policy iteration: Bias-variance trade-off in control problems. In Proceedings of the 27th International Conference on Machine Learning. Omnipress, 2010. E. Todorov. Linearly-solvable Markov decision problems. In Advances in Neural Information Processing Systems 19, pages 1369–1376. MIT Press, 2007. N. Vlassis and M. Toussaint. Model-free reinforcement learning as mixture learning. In Proceedings of the 26th International Conference on Machine Learning, pages 1081–1088. ACM Press, 2009. T. Wang, M. Bowling, and D. Schuurmans. Dual representations for dynamic programming and reinforcement learning. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 44–51. IEEE Press, 2007. T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Stable dual dynamic programming. In Advances in Neural Information Processing Systems 20, pages 1569–1576. MIT Press, 2008. C. Watkins and P. Dayan. Q-learning. Machine Learning, 3(8):279–292, 1992. 3245