nips nips2009 nips2009-48 nips2009-48-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Joel Veness, David Silver, Alan Blair, William W. Cohen
Abstract: In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuel’s checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play. 1
Baxter, J., Tridgell, A., & Weaver, L. (1998). Knightcap: a chess program that learns by combining td(lambda) with game-tree search. Proc. 15th International Conf. on Machine Learning (pp. 28–36). Morgan Kaufmann, San Francisco, CA. Beal, D. F. (1990). A generalised quiescence search algorithm. Artificial Intelligence, 43, 85–98. Beal, D. F., & Smith, M. C. (1997). Learning piece values using temporal differences. Journal of the International Computer Chess Association. Buro, M. (1999). From simple features to sophisticated evaluation functions. First International Conference on Computers and Games (pp. 126–145). Campbell, M., Hoane, A., & Hsu, F. (2002). Deep Blue. Artificial Intelligence, 134, 57–83. Gelly, S., & Silver, D. (2007). Combining online and offline learning in UCT. 17th International Conference on Machine Learning (pp. 273–280). Hauk, T., Buro, M., & Schaeffer, J. (2004). Rediscovering *-minimax search. Computers and Games (pp. 35–50). Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3. Schaeffer, J. (1989). The history heuristic and alpha-beta search enhancements in practice. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-11, 1203–1212. Schaeffer, J., Hlynka, M., & Jussila, V. (2001). Temporal difference learning applied to a high performance game playing program. IJCAI, 529–534. Sutton, R. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44. Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6, 215–219. Veness, J., & Blair, A. (2007). Effective use of transposition tables in stochastic game tree search. IEEE Symposium on Computational Intelligence and Games (pp. 112–116). 9