nips nips2010 nips2010-168 nips2010-168-reference knowledge-graph by maker-knowledge-mining

168 nips-2010-Monte-Carlo Planning in Large POMDPs

Source: pdf

Author: David Silver, Joel Veness

Abstract: This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent’s belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan eﬀectively in signiﬁcantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 1018 and 1056 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the ﬁrst general purpose planner to achieve high performance in such large and unfactored POMDPs. 1

reference text

[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47(2-3):235–256, 2002.

[2] D. Bertsekas and D. Casta˜on. Rollout algorithms for stochastic scheduling problems. n Journal of Heuristics, 5(1):89–108, 1999.

[3] R. Coulom. Eﬃcient selectivity and backup operators in Monte-Carlo tree search. In 5th International Conference on Computer and Games, 2006-05-29, pages 72–83, 2006.

[4] H. Finnsson and Y. Bj¨rnsson. Simulation-based approach to general game playing. In o 23rd Conference on Artiﬁcial Intelligence, pages 259–264, 2008.

[5] S. Gelly and D. Silver. Combining online and oﬄine learning in UCT. In 17th International Conference on Machine Learning, pages 273–280, 2007.

[6] L. Kaelbling, M. Littman, and A. Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial Intelligence, 101:99–134, 1995.

[7] M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems 12. MIT Press, 2000.

[8] L. Kocsis and C. Szepesvari. Bandit based Monte-Carlo planning. In 15th European Conference on Machine Learning, pages 282–293, 2006.

[9] H. Kurniawati, D. Hsu, and W. Lee. SARSOP: Eﬃcient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, 2008.

[10] R. Lorentz. Amazons discover Monte-Carlo. In Computers and Games, pages 13–24, 2008.

[11] S. Ong, S. Png, D. Hsu, and W. Lee. POMDPs for robotic tasks with mixed observability. In Robotics: Science and Systems, 2009.

[12] J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large POMDPs. Journal of Artiﬁcial Intelligence Research, 27:335–380, 2006.

[13] S. Ross, J. Pineau, S. Paquet, and B. Chaib-draa. Online planning algorithms for pomdps. Journal of Artiﬁcial Intelligence Research, 32:663–704, 2008.

[14] T. Smith and R. Simmons. Heuristic search value iteration for pomdps. In 20th conference on Uncertainty in Artiﬁcial Intelligence, 2004.

[15] G. Tesauro and G. Galperin. Online policy improvement using Monte-Carlo search. In Advances in Neural Information Processing 9, pages 1068–1074, 1996. 9