jmlr jmlr2011 jmlr2011-104 jmlr2011-104-reference knowledge-graph by maker-knowledge-mining

104 jmlr-2011-X-Armed Bandits

Source: pdf

Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári

Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a ﬁnite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modiﬁed strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with inﬁnitely many arms, optimistic online optimization, regret bounds, minimax rates

reference text

J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: an efﬁcient algorithm for bandit linear optimization. In Proceedings of the 21st International Conference on Learning Theory. Omnipress, 2008. R. Agrawal. Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics, 27:1054–1078, 1995a. R. Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33:1926–1951, 1995b. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 47(2-3):235–256, 2002a. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002b. P. Auer, R. Ortner, and C. Szepesv´ ri. Improved rates for the stochastic continuum-armed bandit a problem. In Proceedings of the 20th Conference on Learning Theory, pages 454–468, 2007. S. Bubeck and R. Munos. Open loop optimistic planning. In Proceedings of the 23rd International Conference on Learning Theory. Omnipress, 2010. S. Bubeck, R. Munos, G. Stoltz, and Cs. Szepesvari. Online optimization in X –armed bandits. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 201–208, 2009. S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. Theoretical Computer Science, 412:1832–1852, 2011. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. G.M.J. Chaslot, M.H.M. Winands, H. Herik, J. Uiterwijk, and B. Bouzy. Progressive strategies for Monte-Carlo tree search. New Mathematics and Natural Computation, 4(3):343–357, 2008. E. Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. IEEE Transactions on Automatic Control, 54(6):1243–1253, 2009. 1694 X -A RMED BANDITS P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of the 23rd Conference on Uncertainty in Artiﬁcial Intelligence, pages 67–74, 2007. J. L. Doob. Stochastic Processes. John Wiley & Sons, 1953. H. Finnsson and Y. Bjornsson. Simulation-based approach to general game playing. In Proceedings of the Twenty-Third AAAI Conference on Artiﬁcial Intelligence, pages 259–264, 2008. S. Gelly and D. Silver. Combining online and ofﬂine knowledge in UCT. In Proceedings of the 24th international conference on Machine learning, pages 273–280. ACM New York, NY, USA, 2007. S. Gelly and D. Silver. Achieving master level play in 9× 9 computer go. In Proceedings of the Twenty-Third AAAI Conference on Artiﬁcial Intelligence, pages 1537–1540, 2008. S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modiﬁcation of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA, 2006. J. C. Gittins. Multi-armed Bandit Allocation Indices. Wiley-Interscience Series in Systems and Optimization. Wiley, Chichester, NY, 1989. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems 18, 2004. R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing, 2008a. R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces, September 2008b. URL http://arxiv.org/abs/0809.4882. L. Kocsis and Cs. Szepesvari. Bandit based Monte-carlo planning. In Proceedings of the 15th European Conference on Machine Learning, pages 282–293, 2006. T. L. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527–535, 1952. M.P.D. Schadd, M.H.M. Winands, H.J. van den Herik, and H. Aldewereld. Addressing NP-complete puzzles with Monte-Carlo methods. In Proceedings of the AISB 2008 Symposium on Logic and the Simulation of Interaction and Reasoning, volume 9, pages 55—61. The Society for the study of Artiﬁcial Intelligence and Simulation of Behaviour, 2008. Y. Yang. How powerful can any regression learning procedure be? In Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Statistics, volume 2, pages 636–643, 2007. 1695