jmlr jmlr2005 jmlr2005-61 jmlr2005-61-reference knowledge-graph by maker-knowledge-mining

61 jmlr-2005-Prioritization Methods for Accelerating MDP Solvers

Source: pdf

Author: David Wingate, Kevin D. Seppi

Abstract: The performance of value and policy iteration can be dramatically improved by eliminating redundant or useless backups, and by backing up states in the right order. We study several methods designed to accelerate these iterative solvers, including prioritization, partitioning, and variable reordering. We generate a family of algorithms by combining several of the methods discussed, and present extensive empirical evidence demonstrating that performance can improve by several orders of magnitude for many problems, while preserving accuracy and convergence guarantees. Keywords: Markov Decision Processes, value iteration, policy iteration, prioritized sweeping, dynamic programming

reference text

Charles J. Alpert. Multi-way graph and hypergraph partitioning. PhD thesis, University of California Los Angeles, Los Angeles, CA, 1996. 878 P RIORITIZATION M ETHODS FOR ACCELERATING MDP S OLVERS David Andre, Nir Friedman, and Ronald Parr. Generalized prioritized sweeping. Advances in Neural Information Processing Systems, 10:1001–1007, 1998. Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994. Andrew G. Barto, S. J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamic programming. Artiﬁcial Intelligence, 72(1):81–138, 1995. Dimitri P. Bertsekas. Distributed dynamic programming. IEEE Transactions on Automatic Control, 27:610–616, 1982. Dimitri P. Bertsekas. Distributed asynchronous computation of ﬁxed points. Mathematics Programming, 27:107–120, 1983. Dimitri P. Bertsekas and John Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, MA, 1996. Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Englewood Cliffs, NJ, 1989. Thomas Dean and Robert Givan. Model minimization in Markov Decision Processes. In Proceedings of The Fourteenth National Conference on Artiﬁcial Intelligence, pages 106–111, 1997. Geoffrey J. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, 1999. Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efﬁcient solution algorithms for factored MDPs. Journal of Artiﬁcial Intelligence Research, 19:399–468, 2003. Vijaykumar Gullapalli and Andrew G. Barto. Convergence of indirect adaptive asynchronous value iteration algorithms. Advances in Neural Information Processing Systems, 6:695–702, 1994. George Karypis and Vipin Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48:96–129, 1998. Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49:209–232, 2002. Donald E. Knuth. The Stanford GraphBase: A Platform for Combinatorial Computing. ACM Press, New York, NY, 1993. Harold J. Kushner and Paul Dupuis. Numerical methods for stochastic control problems in continuous time, Second Edition. Springer-Verlag, New York, NY, 2001. Michael L. Littman, Thomas L. Dean, and Leslie P. Kaelbling. On the complexity of solving Markov Decision Problems. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artiﬁcial Intelligence, pages 394–402, 1995. 879 W INGATE AND S EPPI Raymond E. Miller and James W. Thatcher. Complexity of computer computations. Plenum Press, New York, NY, 1972. Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103–130, 1993. Andrew W. Moore and Christopher G. Atkeson. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state space. Machine Learning, 21:199–233, 1995. Remi Munos and Andrew W. Moore. Variable resolution discretization in optimal control. Machine Learning, 49:291–323, 2002. Jing Peng and John Williams. Efﬁcient learning and planning within the dyna framework. In Proceedings of the Second International Conference on Simulation of Adaptive Behavior, pages 437–454, 1993. Jing Peng and Ronald J. Williams. Incremental multi-step Q-learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 226–232, 1994. Philippe Preux. Propagation of Q-values in tabular TD(lambda). In Proceedings of the Thirteenth European Conference on Machine Learning, pages 369–380, 2002. Martin L. Puterman. Markov Decision Processes–Discrete Stochastic Dynamic Programming. John Wiley and Sons, Inc., New York, NY, 1994. Martin L. Puterman and Moon C. Shin. Modiﬁed policy iteration algorithms for discounted Markov Decision Problems. Management Science, 24:1127–1137, 1978. Stuart I. Reynolds. Reinforcement Learning with Exploration. PhD thesis, University of Birmingham, Birmingham, United Kingdom, 2002. Gavin A. Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University, Cambridge, United Kingdom, 1994. Yousef Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing, Boston, 1996. Hermann A. Schwarz. Gesammelte Mathematische Abhandlungen, volume 2. Springer-Verlag, 1890. Satinder P. Singh and Richard S. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123–158, 1996. Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, 8:1038–1044, 1996. Ray S. Tuminaro, Mike Heroux, S. A. Hutchinson, and John N. Shadid. Ofﬁcial Aztec User’s Guide: Version 2.1. Sandia National Laboratory, Albuquerque, NM, 1999. 880 P RIORITIZATION M ETHODS FOR ACCELERATING MDP S OLVERS Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS-93-14, Northeastern University, Boston, MA, 1993. David Wingate and Kevin D. Seppi. Cache efﬁciency of priority metrics for MDP solvers. In AAAI Workshop on Learning and Planning in Markov Processes, pages 103–106, 2004a. David Wingate and Kevin D. Seppi. P3VI: A partitioned, prioritized, parallel value iterator. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 863–870, 2004b. Nevin L. Zhang, Stephen S. Lee, and Weihong Zhang. A method for speeding up value iteration in partially observable Markov Decision Processes. In Proceedings of the Fifteenth Conference on Uncertainty in Artiﬁcial Intelligence, pages 696–703, 1999. Nevin L. Zhang and Weihong Zhang. Speeding up the convergence of value iteration in partially observable Markov Decision Processes. Journal of Artiﬁcial Intelligence Research, 14:29–51, 2001. Weiyu Zhu and Stephen Levinson. PQ-learning: an efﬁcient robot learning method for intelligent behavior acquisition. In Proceedings of the Seventh International Conference on Intelligent Autonomous Systems, 2002. 881