jmlr jmlr2007 jmlr2007-41 jmlr2007-41-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohammad Ghavamzadeh, Sridhar Mahadevan
Abstract: Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL including HAMs, options, MAXQ, and PHAMs has been limited to the discrete-time discounted reward semi-Markov decision process (SMDP) model. The average reward optimality criterion has been recognized to be more appropriate for a wide class of continuing tasks than the discounted framework. Although average reward RL has been studied for decades, prior work has been largely limited to flat policy representations. In this paper, we develop a framework for HRL based on the average reward optimality criterion. We investigate two formulations of HRL based on the average reward SMDP model, both for discrete-time and continuous-time. These formulations correspond to two notions of optimality that have been previously explored in HRL: hierarchical optimality and recursive optimality. We present algorithms that learn to find hierarchically and recursively optimal average reward policies under discrete-time and continuous-time average reward SMDP models. We use two automated guided vehicle (AGV) scheduling tasks as experimental testbeds to study the empirical performance of the proposed algorithms. The first problem is a relatively simple AGV scheduling task, in which the hierarchically and recursively optimal policies are different. We compare the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. The second problem is a larger AGV scheduling task. We model this problem using both discrete-time and continuous-time models. We use a hierarchical task decomposition in which the hierarchically and recursively optimal policies are the same for this problem. We compare the performance of the proposed algorithms
J. Abounadi, D. P. Bertsekas, and V. S. Borkar. Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization, 40:681–698, 2001. D. Andre. Programmable Reinforcement Learning Agents. PhD thesis, University of California at Berkeley, 2003. D. Andre and S. J. Russell. Programmable reinforcement learning agents. In Proceedings of Advances in Neural Information Processing Systems 13, pages 1019–1025. MIT Press, 2001. D. Andre and S. J. Russell. State abstraction for programmable reinforcement learning agents. In Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages 119–125, 2002. A. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Systems (Special Issue on Reinforcement Learning), 13:41–77, 2003. R. Bellman. Dynamic Programming. Princeton University Press, 1957. S. Bradtke and M. Duff. Reinforcement learning methods for continuous-time Markov decision problems. In Proceedings of Advances in Neural Information Processing Systems 7, pages 393– 400. MIT Press, 1995. R. Crites and A. Barto. Elevator group control using multiple reinforcement learning agents. Machine Learning, 33:235–262, 1998. P. Dayan and G. Hinton. Feudal reinforcement learning. In Proceedings of Advances in Neural Information Processing Systems 5, pages 271–278, 1993. T. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000. M. Ghavamzadeh and S. Mahadevan. Continuous-time hierarchical reinforcement learning. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 186–193, 2001. M. Ghavamzadeh and S. Mahadevan. Hierarchically optimal average reward reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 195–202, 2002. R. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960. R. Howard. Dynamic Probabilistic Systems: Semi-Markov and Decision Processes. John Wiley and Sons., 1971. L. Kaelbling. Hierarchical reinforcement learning: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pages 167–173, 1993a. L. Kaelbling. Learning to achieve goals. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 1094–1098, 1993b. 2667 G HAVAMZADEH AND M AHADEVAN S. Mahadevan. Average reward reinforcement learning: foundations, algorithms, and empirical results. Machine Learning, 22:159–196, 1996. S. Mahadevan and J. Connell. Automatic programming of behavior-based robots using reinforcement learning. Artificial Intelligence, 55(2-3):311–365, 1992. S. Mahadevan, N. Khaleeli, and N. Marchalleck. Designing agent controllers using discrete-event Markov models. In Proceedings of the AAAI Fall Symposium on Model-Directed Autonomous Systems, 1997a. S. Mahadevan, N. Marchalleck, T. Das, and A. Gosavi. Self-improving factory simulation using continuous-time average reward reinforcement learning. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 182–190, 1997b. P. Marbach. Simulated-Based Methods for Markov Decision Processes. PhD thesis, Massachusetts Institute of Technology, 1998. A. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 278–287, 1999. A. Ng, H. Kim, M. Jordan, and S. Sastry. Autonomous helicopter flight via reinforcement learning. In Proceedings of Advances in Neural Information Processing Systems 16. MIT Press, 2004. R. Parr. Hierarchical Control and Learning for Markov Decision Processes. PhD thesis, University of California at Berkeley, 1998. D. Precup. Temporal Abstraction in Reinforcement Learning. PhD thesis, University of Massachusetts Amherst, 2000. M. Puterman. Markov Decision Processes. Wiley Interscience, 1994. A. Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the Tenth International Conference on Machine Learning, pages 298–305, 1993. S. Seri and P. Tadepalli. Model-based hierarchical average-reward reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 562–569, 2002. S. Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8:323–339, 1992. S. Singh and D. Bertsekas. Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Proceedings of Advances in Neural Information Processing Systems 9, pages 974–980, 1996. R. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211, 1999. P. Tadepalli and D. Ok. Scaling up average reward reinforcement learning by approximating the domain models and the value function. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 471–479, 1996a. 2668 H IERARCHICAL AVERAGE R EWARD R EINFORCEMENT L EARNING P. Tadepalli and D. Ok. Auto-exploratory average reward reinforcement learning. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 881–887, 1996b. P. Tadepalli and D. Ok. Model-based average reward reinforcement learning. Artificial Intelligence, 100:177–224, 1998. G. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6:215–219, 1994. B. Van-Roy. Learning and Value Function Approximation in Complex Decision Processes. PhD thesis, Massachusetts Institute of Technology, 1998. G. Wang and S. Mahadevan. Hierarchical optimization of policy-coupled semi-Markov decision processes. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 464–473, 1999. W. Zhang and T. Dietterich. A reinforcement learning approach to job-shop scheduling. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1114–1120, 1995. 2669