nips nips2005 nips2005-153 nips2005-153-reference knowledge-graph by maker-knowledge-mining

153 nips-2005-Policy-Gradient Methods for Planning

Source: pdf

Author: Douglas Aberdeen

Abstract: Probabilistic temporal planning attempts to ﬁnd good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1

reference text

[1] L. Peshkin, K.-E. Kim, N. Meuleau, and L. P. Kaelbling. Learning to cooperate via policy search. In UAI, 2000.

[2] Nigel Tao, Jonathan Baxter, and Lex Weaver. A multi-agent, policy-gradient approach to network routing. In Proc. ICML’01. Morgan Kaufmann, 2001.

[3] J. Baxter, P. Bartlett, and L. Weaver. Experiments with inﬁnite-horizon, policy-gradient estimation. JAIR, 15:351–381, 2001.

[4] Mausam and Daniel S. Weld. Concurrent probabilistic temporal planning. In Proc. International Conference on Automated Planning and Scheduling, Moneteray, CA, June 2005. AAAI.

[5] I. Little, D. Aberdeen, and S. Thi´ baux. Prottle: A probabilistic temporal planner. In Proc. e AAAI’05, 2005.

[6] Hakan L. S. Younes and Reid G. Simmons. Policy generation for continuous-time stochastic domains with concurrency. In Proc. of ICAPS’04, volume 14, 2005.

[7] Douglas Aberdeen, Sylvie Thi´ baux, and Lin Zhang. Decision-theoretic military operations e planning. In Proc. ICAPS, volume 14, pages 402–411. AAAI, June 2004.

[8] A.G. Barto, S. Bradtke, and S. Singh. Learning to act using real-time dynamic programming. Artiﬁcial Intelligence, 72, 1995.

[9] A.Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proc. ICML’99, 1999.

[10] Douglas Aberdeen. The factored policy-gradient planner. Technical report, NICTA, 2005.