nips nips2011 nips2011-174 nips2011-174-reference knowledge-graph by maker-knowledge-mining

174 nips-2011-Monte Carlo Value Iteration with Macro-Actions


Source: pdf

Author: Zhan Lim, Lee Sun, Daniel J. Hsu

Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1


reference text

[1] H. Bai, D. Hsu, M.J. Kochenderfer, and W. S. Lee. Unmanned aircraft collision avoidance using continuous-state POMDPs. In Proc. Robotics: Science & Systems, 2011.

[2] H. Bai, D. Hsu, W. S. Lee, and V. Ngo. Monte Carlo Value Iteration for Continuous-State POMDPs. In Algorithmic Foundations of Robotics IX—Proc. Int. Workshop o n the Algorithmic Foundations of Robotics (WAFR), pages 175–191. Springer, 2011.

[3] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13:2003, 2003.

[4] T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artificial Intelligence Research, 13:227–303, 2000.

[5] E. Hansen and R. Zhou. Synthesis of hierarchical finite-state controllers for POMDPs. In Proc. Int. Conf. on Automated Planning and Scheduling, 2003.

[6] M. Hauskrecht, N. Meuleau, L.P. Kaelbling, T. Dean, and C. Boutilier. Hierarchical solution of Markov decision processes using macro-actions. In Proc. Conf. on Uncertainty in Artificial Intelligence, pages 220–229. Citeseer, 1998.

[7] R. He, E. Brunskill, and N. Roy. PUMA: Planning under uncertainty with macro-actions. In Proc. AAAI Conf. on Artificial Intelligence, 2010.

[8] H. Kurniawati, Y. Du, D. Hsu, and W. S. Lee. Motion planning under uncertainty for robotic tasks with long time horizons. Int. J. Robotics Research, 30(3):308–323, 2010.

[9] H. Kurniawati, D. Hsu, and W.S. Lee. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Proc. Robotics: Science & Systems, 2008.

[10] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In Int. Jnt. Conf. on Artificial Intelligence, volume 18, pages 1025–1032, 2003.

[11] J. Pineau, N. Roy, and S. Thrun. A hierarchical approach to POMDP planning and execution. In Workshop on Hierarchy & Memory in Reinforcement Learning (ICML), volume 156, 2001.

[12] S. Ross, J. Pineau, S. Paquet, and B. Chaib-Draa. Online planning algorithms for POMDPs. Journal of Artificial Intelligence Research, 32(1):663–704, 2008.

[13] A. Sivakumar and C.K.Y. Tan. UAV swarm coordination using cooperative control for establishing a wireless communications backbone. In Proc. Int. Conf. on Autonomous Agents & Multiagent Systems, pages 1157–1164, 2010.

[14] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Proc. Conf. on Uncertainty in Artificial Intelligence, pages 520–527. AUAI Press, 2004.

[15] R.S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181–211, 1999.

[16] G. Theocharous and L. P. Kaelbling. Approximate planning in POMDPs with macro-actions. Advances in Neural Processing Information Systems, 17, 2003.

[17] M. Toussaint, L. Charlin, and P. Poupart. Hierarchical POMDP controller optimization by likelihood maximization. Proc. Conf. on Uncertainty in Artificial Intelligence, 2008.

[18] C.C. White. Procedures for the solution of a finite-horizon, partially observed, semi-Markov optimization problem. Operations Research, 24(2):348–358, 1976. 9