jmlr jmlr2006 jmlr2006-30 jmlr2006-30-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shimon Whiteson, Peter Stone
Abstract: Temporal difference methods are theoretically grounded and empirically effective methods for addressing reinforcement learning problems. In most real-world reinforcement learning tasks, TD methods require a function approximator to represent the value function. However, using function approximators requires manually making crucial representational decisions. This paper investigates evolutionary function approximation, a novel approach to automatically selecting function approximator representations that enable efficient individual learning. This method evolves individuals that are better able to learn. We present a fully implemented instantiation of evolutionary function approximation which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method. The resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators. This paper also presents on-line evolutionary computation, which improves the on-line performance of evolutionary computation by borrowing selection mechanisms used in TD methods to choose individual actions and using them in evolutionary computation to select policies for evaluation. We evaluate these contributions with extended empirical studies in two domains: 1) the mountain car task, a standard reinforcement learning benchmark on which neural network function approximators have previously performed poorly and 2) server job scheduling, a large probabilistic domain drawn from the field of autonomic computing. The results demonstrate that evolutionary function approximation can significantly improve the performance of TD methods and on-line evolutionary computation can significantly improve evolutionary methods. This paper also presents additional tests that offer insight into what factors can make neural network function approximation difficult in practice. Keywords: reinforcement learning, temporal difference methods, evolutionary computation, neuroevolution, on-
D. Ackley and M. Littman. Interactions between learning and evolution. Artificial Life II, SFI Studies in the Sciences of Complexity, 10:487–509, 1991. T. Arita and R. Suzuki. Interactions between learning and evolution: The outstanding strategy generated by the Baldwin Effect. Artificial Life, 7:196–205, 2000. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002. L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30–37. Morgan Kaufmann, 1995. L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11. MIT Press, 1999. J. M. Baldwin. A new factor in evolution. The American Naturalist, 30:441–451, 1896. T. Beielstein and S. Markon. Threshold selection, hypothesis tests and DOE methods. In 2002 Congresss on Evolutionary Computation, pages 777–782, 2002. R. E. Bellman. A problem in the sequential design of experiments. Sankhya, 16:221–229, 1956. E. J. W. Boers, M. V. Borst, and I. G. Sprinkhuizen-Kuyper. Evolving Artificial Neural Networks using the “Baldwin Effect”. In Artificial Neural Nets and Genetic Algorithms, Proceedings of the International Conference in Ales, France, 1995. J. A. Boyan and M. L. Littman. Packet routing in dynamically changing networks: A reinforcement learning approach. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 671–678. Morgan Kaufmann Publishers, Inc., 1994. J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, 1995. 912 E VOLUTIONARY F UNCTION A PPROXIMATION FOR R EINFORCEMENT L EARNING M. V. Butz and S. W. Wilson. An algorithmic description of XCS. Soft Computing - A Fusion of Foundations, Methodologies and Applications, 6(3-4):144–153, 2002. R. H. Crites and A. G. Barto. Elevator group control using multiple reinforcement learning agents. Machine Learning, 33(2-3):235–262, 1998. K. Mathias D. Whitley, S. Gordon. Lamarckian evolution, the Baldwin effect and function optimization. In Parallel Problem Solving from Nature - PPSN III, pages 6–15, 1994. N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adverserial classification. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99–108, 2004. K. L. Downing. Reinforced genetic programming. Genetic Programming and Evolvable Machines, 2(3):259–288, 2001. L. Ertoz, A. Lazarevic, E. Eilerston, A. Lazarevic, P. Tan, P. Dokas, V. Kumar, and J. Srivastava. The MINDS - Minnesota Intrustion Detection System, chapter 3. MIT Press, 2004. T. C. Fogarty. An incremental genetic algorithm for real-time learning. In Proceedings of the Sixth International Workshop on Machine Learning, pages 416–419, 1989. R. French and A. Messinger. Genes, phenes and the Baldwin effect: Learning and evolution in a simulated population. Artificial Life, 4:277–282, 1994. C. Giraud-Carrier. Unifying learning with evolution through Baldwinian evolution and Lamarckism: A case study. In Proceedings of the Symposium on Computational Intelligence and Learning (CoIL-2000), pages 36–41, 2000. D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. 1989. D. E. Goldberg and J. Richardson. Genetic algorithms with sharing for multimodal function optimization. In Proceedings of the Second International Conference on Genetic Algorithms, pages 148–154, 1987. F. Gomez, D. Burger, and R. Miikkulainen. A neuroevolution method for dynamic resource allocation on a chip multiprocessor. In Proceedings of the INNS-IEEE International Joint Conference on Neural Networks, pages 2355–2361, 2001. F. Gruau and D. Whitley. Adding learning to the cellular development of neural networks: Evolution and the Baldwin effect. Evolutionary Computation, 1:213–233, 1993. F. Gruau, D. Whitley, and L. Pyeatt. A comparison between cellular encoding and direct encoding for genetic neural networks. In Genetic Programming 1996: Proceedings of the First Annual Conference, pages 81–89, 1996. G. E. Hinton and S. J. Nowlan. How learning can guide evolution. Complex Systems, 1:495–502, 1987. 913 W HITESON AND S TONE J. H. Holland. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. University of Michigan Press, Ann Arbor, MI, 1975. J. O. Kephart and D. M. Chess. The vision of autonomic computing. Computer, 36(1):41–50, January 2003. N. Kohl and P. Stone. Machine learning for fast quadrupedal locomotion. In The Nineteenth National Conference on Artificial Intelligence, pages 611–616, July 2004. V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems 11, pages 1008–1014, 1999. R. M. Kretchmar and C. W. Anderson. Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning. In International Conference on Neural Networks, 1997. M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4(2003):1107–1149, 2003. P. L. Lanzi, W. Stolzmann, and S. Wilson. Learning classifier systems from foundations to applications. Springer, 2000. L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning, and teaching. Machine Learning, 8(3-4):293–321, 1992. C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. Journal of the Association of Computing Machinery, 20(1):46–61, January 1973. W. G. Macready and D. H. Wolpert. Bandit problems and the exploration/exploitation tradeoff. In IEEE Transactions on Evolutionary Computation, volume 2(1), pages 2–22, 1998. S. Mahadevan. Samuel meets Amarel: Automating value function approximation using global state space analysis. In Proceedings of the Twentieth National Conference on Artificial Intelligence, 2005. S. Mannor, R. Rubenstein, and Y. Gat. The cross-entropy method for fast policy search. In Proceedings of the Twentieth International Conference on Machine Learning, pages 512–519, 2003. A. R. McCallum. Instance-based utile distinctions for reinforcement learning. In Proceedings of the Twelfth International Machine Learning Conference, 1995. A. McGovern, E. Moss, and A. G. Barto. Building a block scheduler using reinforcement learning and rollouts. Machine Learning, 49(2-3):141–160, 2002. P. McQuesten and R. Miikkulainen. Culling and teaching in neuro-evolution. In Thomas B¨ ck, a editor, Proceedings of the Seventh International Conference on Genetic Algorithms, pages 760– 767, 1997. 914 E VOLUTIONARY F UNCTION A PPROXIMATION FOR R EINFORCEMENT L EARNING D. McWherter, B. Schroeder, N. Ailamaki, and M. Harchol-Balter. Priority mechanisms for OLTP and transactional web applications. In Proceedings of the Twentieth International Conference on Data Engineering, 2004. A. Y. Ng and M. I. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages 406–415. Morgan Kaufmann Publishers Inc., 2000. Y. Niv, D. Joel, I. Meilijson, and E. Ruppin. Evolution of reinforcement learning in foraging bees: A simple explanation for risk averse behavior. Neurocomputing, 44(1):951–956, 2002. S. Nolfi, J. L. Elman, and D. Parisi. Learning and evolution in neural networks. Adaptive Behavior, 2:5–28, 1994. F. B. Pereira and E. Costa. Understanding the role of learning in the evolution of busy beaver: A comparison between the Baldwin Effect and a Lamarckian strategy. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), 2001. L. D. Pyeatt and A. E. Howe. Decision tree function approximation in reinforcement learning. In Proceedings of the Third International Symposium on Adaptive Systems: Evolutionary Computation and Probabilistic Graphical Models, pages 70–77, 2001. N. J. Radcliffe. Genetic set recombination and its application to neural network topology optimization. Neural computing and applications, 1(1):67–90, 1993. M. Reidmiller. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In Proceedings of the Sixteenth European Conference on Machine Learning, pages 317–328, 2005. F. Rivest and D. Precup. Combining TD-learning with cascade-correlation networks. In Proceedings of the Twentieth International Conference on Machine Learning, pages 632–639. AAAI Press, 2003. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing, pages 318–362. 1986. J. Santamaria, R. Sutton, and A. Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior, 6(2), 1998. T. Sasaki and M. Tokoro. Evolving learnable neural networks under changing environments with various rates of inheritance of acquired characters: Comparison between Darwinian and Lamarckian evolution. Artificial Life, 5(3):203–223, 1999. A. A. Sherstov and P. Stone. Function approximation via tile coding: Automating parameter choice. In J.-D. Zucker and I. Saitta, editors, SARA 2005, volume 3607 of Lecture Notes in Artificial Intelligence, pages 194–205. Springer Verlag, Berlin, 2005. W. D. Smart and L. P. Kaelbling. Practical reinforcement learning in continuous spaces. In Proceedings of the Seventeeth International Conference on Machine Learning, pages 903–910, 2000. 915 W HITESON AND S TONE A. J. Smith. Applications of the self-organizing map to reinforcement learning. Journal of Neural Networks, 15:1107–1124, 2002. P. Stagge. Averaging efficiently in the presence of noise. In Parallel Problem Solving from Nature, volume 5, pages 188–197, 1998. K. O. Stanley, B. D. Bryant, and R. Miikkulainen. Evolving adaptive neural networks with and without adaptive synapses. In Proceeedings of the 2003 Congress on Evolutionary Computation (CEC 2003), volume 4, pages 2557–2564, 2003. K. O. Stanley, B. D. Bryant, and R. Miikkulainen. Evolving neural network agents in the NERO video game. In Proceedings of the IEEE 2005 Symposium on Computational Intelligence and Games, 2005. K. O. Stanley and R. Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2):99–127, 2002. K. O. Stanley and R. Miikkulainen. Competitive coevolution through evolutionary complexification. Journal of Artificial Intelligence Research, 21:63–100, 2004a. K. O. Stanley and R. Miikkulainen. Evolving a roving eye for go. In Proceedinngs of the Genetic and Evolutionary Computation Conference, 2004b. R. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, pages 1038–1044, 1996. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pages 1057–1063, 2000. G. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215–219, 1994. J. A. van Mieghem. Dynamic scheduling with convex delay costs: The generalized cµ rule. The Annals of Applied Probability, 5(3):809–833, August 1995. W. E. Walsh, G. Tesauro, J. O. Kephart, and R. Das. Utility functions in autonomic systems. In Proceedings of the International Conference on Autonomic Computing, pages 70–77, 2004. C. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, 1989. S. Whiteson and P. Stone. Adaptive job routing and scheduling. Engineering Applications of Artificial Intelligence, 17(7):855–869, 2004. Corrected version. 916 E VOLUTIONARY F UNCTION A PPROXIMATION FOR R EINFORCEMENT L EARNING J. Wildstrom, P. Stone, E. Witchel, R. J. Mooney, and M. Dahlin. Towards self-configuring hardware for distributed computer systems. In The Second International Conference on Autonomic Computing, pages 241–249, June 2005. K. Yamasaki and M. Sekiguchi. Clear explanation of different adaptive behaviors between Darwinian population and Lamarckian population in changing environment. In Proceedings of the Fifth International Symposium on Artificial Life and Robotics, volume 1, pages 120–123, 2000. X. Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447, 1999. W. Zhang and T. G. Dietterich. A reinforcement learning approach to job-shop scheduling. In Proceedings of the 1995 Joint Conference on Artificial Intelligence, pages 1114–1120, 1995. M. Zweben and M. Fox, editors. Intelligent Scheduling. Morgan Kaufmann, 1998. 917