nips nips2012 nips2012-108 nips2012-108-reference knowledge-graph by maker-knowledge-mining

108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Source: pdf

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, ﬁnding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a signiﬁcant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an inﬁnite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

reference text

[1] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 19–26, 2009.

[2] J. Asmuth and M. Littman. Approaching Bayes-optimality using Monte-Carlo tree search. In Proceedings of the 27th Conference on Uncertainty in Artiﬁcial Intelligence, pages 19–26, 2011.

[3] R. Bellman and R. Kalaba. On adaptive control processes. Automatic Control, IRE Transactions on, 4(2):1–9, 1959.

[4] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In Proceedings of the 20th international conference on Algorithmic learning theory, pages 23–37. Springer-Verlag, 2009.

[5] P. Castro and D. Precup. Smarter sampling in model-based Bayesian reinforcement learning. Machine Learning and Knowledge Discovery in Databases, pages 200–214, 2010.

[6] P.S. Castro. Bayesian exploration in Markov decision processes. PhD thesis, McGill University, 2007.

[7] P.A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of the 23rd Conference on Uncertainty in Artiﬁcial Intelligence, pages 67–74, 2007.

[8] R. Dearden, N. Friedman, and S. Russell. Bayesian Q-learning. In Proceedings of the National Conference on Artiﬁcial Intelligence, pages 761–768, 1998.

[9] M.O.G. Duff. Optimal Learning: Computational Procedures For Bayes-Adaptive Markov Decision Processes. PhD thesis, University of Massachusetts Amherst, 2002.

[10] AA Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874–1039, 1960.

[11] N. Friedman and Y. Singer. Efﬁcient Bayesian parameter estimation in large discrete domains. Advances in Neural Information Processing Systems (NIPS), pages 417–423, 1999.

[12] S. Gelly, L. Kocsis, M. Schoenauer, M. Sebag, D. Silver, C. Szepesv´ ri, and O. Teytaud. The grand chala lenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM, 55(3):106– 113, 2012.

[13] S. Gelly and D. Silver. Combining online and ofﬂine knowledge in UCT. In Proceedings of the 24th International Conference on Machine learning, pages 273–280, 2007.

[14] J.C. Gittins, R. Weber, and K.D. Glazebrook. Multi-armed bandit allocation indices. Wiley Online Library, 1989.

[15] M. Kearns, Y. Mansour, and A.Y. Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In Proceedings of the 16th international joint conference on Artiﬁcial intelligence-Volume 2, pages 1324–1331, 1999.

[16] L. Kocsis and C. Szepesv´ ri. Bandit based Monte-Carlo planning. Machine Learning: ECML 2006, pages a 282–293, 2006.

[17] J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 513–520, 2009.

[18] J.J. Martin. Bayesian decision problems and Markov chains. Wiley, 1967.

[19] N. Meuleau and P. Bourgine. Exploration of multi-state environments: Local measures and backpropagation of uncertainty. Machine Learning, 35(2):117–154, 1999.

[20] D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. Advances in Neural Information Processing Systems (NIPS), pages 2164–2172, 2010.

[21] J. Sorg, S. Singh, and R.L. Lewis. Variance-based rewards for approximate Bayesian reinforcement learning. In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence, 2010.

[22] M. Strens. A Bayesian framework for reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pages 943–950, 2000.

[23] C. Szepesv´ ri. Algorithms for reinforcement learning. Synthesis Lectures on Artiﬁcial Intelligence and a Machine Learning. Morgan & Claypool Publishers, 2010.

[24] T.J. Walsh, S. Goschin, and M.L. Littman. Integrating sample-based planning and model-based reinforcement learning. In Proceedings of the 24th Conference on Artiﬁcial Intelligence (AAAI), 2010.

[25] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In Proceedings of the 22nd International Conference on Machine learning, pages 956–963, 2005. 9