nips nips2003 nips2003-33 nips2003-33-reference knowledge-graph by maker-knowledge-mining

33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

Source: pdf

Author: Georgios Theocharous, Leslie P. Kaelbling

Abstract: Recent research has demonstrated that useful POMDP solutions do not require consideration of the entire belief space. We extend this idea with the notion of temporal abstraction. We present and explore a new reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. We apply the algorithm to a large scale robot navigation task and demonstrate that with temporal abstraction we can consider an even smaller part of the belief space, we can learn POMDP policies faster, and we can do information gathering more efﬁciently.

reference text

[1] M. Hauskrecht. Value-function approximations for partially observable Markov decision processes. Journal of Artiﬁcial Intelligence Research, 13:33–94, 2000.

[2] W. S. Lovejoy. Computationally feasible bounds for partially observed Markov decision processes. Operations Research, 39(1):162–175, January-February 1991.

[3] O. Madani, S. Hanks, and A. Gordon. On the undecidability of probabilistic planning and inﬁnite-horizon partially observable Markov decision processes. In Proceedings of the Sixteenth National Conference in Artiﬁcial Intelligence, pages 409–416, 1999. J2 J3 START J1 J5 J4 GOAL Figure 7: The ﬁgure shows the actual ﬂoor as it was designed in the Nomad 200 simulator. For the QMDP approach the robot starts from START with uniform initial belief. After reaching J2 the belief becomes bi-modal concentrating on J1 and J2. The robot then keeps turning left and right. On the other hand, with our planning algorithm, the robot again starts from START and a uniform initial belief. Upon reaching J2 the belief becomes bimodal over J1 and J2. The agent resolves its uncertainty by deciding that the best action to take is the go-down-the-corridor macro, at which point it encounters J3 and localizes. The robot then is able to reach its goal by traveling from J3, to J2 , J1, J4, and J5.

[4] S. Mahadevan, G. Theocharous, and N. Khaleeli. Fast concept learning for mobile robots. Machine Learning and Autonomous Robots Journal (joint issue), 31/5:239–251, 1998.

[5] A. Y. Ng, D. Harada, and S. Russell. Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, 1999.

[6] C. Papadimitriou and J. Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operation Research, 12(3), 1987.

[7] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In International Joint Conference on Artiﬁcial Intelligence, 2003.

[8] M. Puterman. Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley, 1994.

[9] N. Roy and G. Gordon. Exponential family PCA for belief compression in POMDPs. In Advances in Neural Information Processing Systems, 2003.

[10] S. J. Russell and P. Norvig. Artiﬁcial Intelligence: A Modern Approach. Prentice Hall, 2nd edition, 2003.

[11] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, pages 112:181–211, 1999.

[12] R. Zhou and E. A. Hansen. An improved grid-based approximation algorithm for POMDPs. In Proceedings of the Seventeenth International Conference in Artiﬁcial intelligence (IJCAI-01), Seattle, WA, August 2001.