nips nips2013 nips2013-330 nips2013-330-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
[1] S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference On Learning Theory (COLT), 2012.
[2] S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
[3] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In 30th International Conference on Machine Learning (ICML), 2013.
[4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.
[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. Oxford Univeristy Press, 2013.
[6] S. Bubeck and Che-Yu Liu. A note on the bayesian regret of thompson sampling with an arbitrairy prior. arXiv:1304.5758, 2013.
[7] O. Capp´ , A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper cone fidence bounds for optimal sequential allocation. Annals of Statistics, 41(3):516–541, 2013.
[8] A. Garivier and O. Capp´ . The kl-ucb algorithm for bounded stochastic bandits and beyond. e In Conference On Learning Theory (COLT), 2011.
[9] J. Honda and A. Takemura. An asymptotically optimal bandit algorithm for bounded support models. In Conference On Learning Theory (COLT), 2010.
[10] H. Jeffreys. An invariant form for prior probability in estimation problem. Proceedings of the Royal Society of London, 186:453–461, 1946.
[11] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory, Lecture Notes in Computer Science, pages 199–213. Springer, 2012.
[12] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
[13] B.C. May, N. Korda, A. Lee, and D. Leslie. Optimistic bayesian sampling in contextual bandit problems. Journal of Machine Learning Research, 13:2069–2106, 2012.
[14] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. arXiv:1301.2609, 2013.
[15] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294, 1933.
[16] A.W Van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.
[17] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, Incorporated, 2010. 9