nips nips2001 nips2001-59 nips2001-59-reference knowledge-graph by maker-knowledge-mining

59 nips-2001-Direct value-approximation for factored MDPs

Source: pdf

Author: Dale Schuurmans, Relu Patrascu

Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1

reference text

[1] D. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific, 1995.

[2] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.

[3] C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 2000.

[4] J. Boyan. Least-squares temporal difference learning. In Proceedings ICML, 1999. [5J T. Dietterich.Hierarchical reinforcement learning vlith the 1\1AXQ value function decomposition. JAIR, 13:227-303,2000.

[6] C. Guestrin, D. Koller, and R. Parr. Max-norm projection for factored MDPs. ·In Proceedings IJCAI, 2001.

[7] D. Koller and R. Parr. Computing factored value functions for policies in structured MDPs. In Proceedings IJCAI, 1999.

[8] D. Koller and R. Parr. Policy iteration for factored MDPs. In Proceedings UAI,2000. .

[9] R. Martin. Large Scale Linear and Integer Optimization. Kluwer, 1999.

[10] M. Puterman. Markov Decision Processes: Discrete Dynamic Programming. Wiley, 1994. [IIJ B. Sallans and G. Hinton. Using free energies to represent Q-values in a multiagent reinforcement learning task. In Proceedings NIPS, 2000.

[12] R. St-Aubin, J. Hoey, and C. Boutilier. APRICODD: Approximating policy construction using decision diagrams. In Proceedings NIPS, 2000. [13J B. Van Roy. Learning and value function approximation in complex decision processes. PhD thesis, MIT, EECS, 1998. [14J R. Williams and L. Baird. Tight performance bounds on greedy policies based _on imperfect value functions. Technical report, Northeastern University, 1993.