jmlr jmlr2005 jmlr2005-68 jmlr2005-68-reference knowledge-graph by maker-knowledge-mining

68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

Source: pdf

Author: Damien Ernst, Pierre Geurts, Louis Wehenkel

Abstract: Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt , ut , rt , xt+1 ) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and ﬁnd that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees. Keywords: batch mode reinforcement learning, regression trees, ensemble methods, supervised learning, ﬁtted value iteration, optimal control

reference text

D. Bagnell, S. Kakade, A. Y. Ng, and J. Schneider. Policy search by dynamic programming. In Proceedings of Neural Information Processing Systems, 2003. L. C. Baird. Residual algorithms: reinforcement learning with function approximation. In Armand Prieditis and Stuart Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 9–12, San Francisco, CA, July 1995. Morgan Kaufman. R. Bellman. Dynamic Programming. Princeton University Press, 1957. R. Bellman, R. Kalaba, and B. Kotkin. Polynomial approximation - a new computational technique in dynamic programming: allocation processes. Mathematical Computation, 17:155–161, 1973. J. A. Boyan. Technical update: least-squares temporal difference learning. Machine Learning, 49 (2-3):233–246, 2002. J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: safely approximating the value function. Advances in Neural Information Processing Systems, 7:369–376, 1995. L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. L. Breiman. Some inﬁnity theory for predictor ensembles. Technical Report 577, University of California, Department of Statistics, 2000. L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. L. Breiman, J. H. Friedman, R. A. Olsen, and C. J. Stone. Classiﬁcation and Regression Trees. Wadsworth International (California), 1984. 553 E RNST, G EURTS AND W EHENKEL D. Ernst. Near Optimal Closed-Loop Control. Application to Electric Power Systems. PhD thesis, University of Li` ge, Belgium, March 2003. e D. Ernst, P. Geurts, and L. Wehenkel. Iteratively extending time horizon reinforcement learning. In N. Lavra, L. Gamberger, and L. Todorovski, editors, Proceedings of the 14th European Conference on Machine Learning, pages 96–107, Dubrovnik, Croatia, September 2003. Springer-Verlag Heidelberg. D. Ernst, M. Glavic, P. Geurts, and L. Wehenkel. Approximate value iteration in the reinforcement learning context. Application to electrical power system control. To appear in Intelligent Automation and Soft Computing, 2005. P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Submitted, 2004. G. J. Gordon. Online ﬁtted reinforcement learning. In VFA workshop at ML-95, 1995a. G. J. Gordon. Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning, pages 261–268, San Francisco, CA, 1995b. Morgan Kaufmann. G. J. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University, June 1999. O. Hern´ ndez-Lerma and B. Lasserre. Discrete-Time Markov Control Processes. Springer, Newa York, 1996. L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of Artiﬁcial Intelligence Research, 4:237–285, 1996. S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267–274, 2002. M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003a. M. G. Lagoudakis and R. Parr. Reinforcement learning as classiﬁcation: leveraging modern classiﬁers. In Proceedings of ICML 2003, pages 424–431, 2003b. J. Langford and B. Zadrozny. Reducing T-step reinforcement learning to classiﬁcation. Submitted, 2004. L. J. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh, USA, 1993. Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Technical Report 1005, Department of Statistics, University of Wisconsin, 2002. D. G. Luenberger. Optimization by Vector Space Methods. Wiley, N.Y., 1969. A. K. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester, Rochester, New-York, 1996. 554 T REE -BASED BATCH M ODE R EINFORCEMENT L EARNING A. W. Moore and C. G. Atkeson. Prioritized sweeping: reinforcement learning with less data and less real time. Machine Learning, 13:103–130, 1993. A. W. Moore and C. G. Atkeson. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21(3):199–233, 1995. A. Y. Ng and M. Jordan. PEGASUS: a policy search method for large MDPs and POMDPs. In Proceedings of the Sixteenth Conference on Uncertainty in Artiﬁcial Intelligence, pages 406– 415, 1999. D. Ormoneit and P. Glynn. Kernel-based reinforcement learning in average-cost problems. IEEE Transactions on Automatic Control, 47(10):1624–1636, 2002. D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49(2-3):161– 178, 2002. J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 463–471, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65(3):487–516, 1997. S. P. Singh, T. Jaakkola, and M. I. Jordan. Reinforcement learning with soft state aggregation. In G. Tesauro, D. S. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems : Proceedings of the 1994 Conference, pages 359–368, Cambridge, MA, 1995. MIT press. W. D. Smart and L. P. Kaelbling. Practical reinforcement learning in continuous spaces. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 903–910, 2000. M. W. Spong. Swing up control of the Acrobot. In 1994 IEEE International Conference on Robotics and Automation, pages 2356–2361, San Diego, CA, May 1994. R. S. Sutton. Learning to predict by the method of temporal differences. Machine Learning, 3(1): 9–44, 1988. R. S. Sutton. Generalization in reinforcement learning: successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, 8:1038–1044, 1996. R. S. Sutton and A. G. Barto. Reinforcement Learning, an Introduction. MIT Press, 1998. J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3): 185–202, 1994. J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large-scale dynamic programming. Machine Learning, 22:59–94, 1996. W. T. B. Uther and M. M. Veloso. Tree based discretization for continuous state space reinforcement learning. In Proceedings of AAAI-98, pages 769–774, 1998. 555 E RNST, G EURTS AND W EHENKEL X. Wang and T. G. Diettrich. Efﬁcient value function approximation using regression trees. In Proceedings of IJCAI-99 Workshop on Statistical Machine Learning for Large-Scale Optimization, Stockholm, Sweden, 1999. C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England, 1989. J. Yoshimoto, S. Ishii, and M. Sato. Application of reinforcement learning to balancing Acrobot. In Proceedings of the 1999 IEEE International Conference on Systems, Man and Cybernetics, pages 516–521, 1999. 556