nips nips2011 nips2011-32 nips2011-32-reference knowledge-graph by maker-knowledge-mining

32 nips-2011-An Empirical Evaluation of Thompson Sampling

Source: pdf

Author: Olivier Chapelle, Lihong Li

Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1

reference text

[1] D. Agarwal, R. Agrawal, R. Khanna, and N. Kota. Estimating rates of rare events with multiple hierarchies through scalable log-linear models. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 213–222, 2010.

[2] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu Ramakrishnan, Scott Roy, and Joe Zachariah. Online models for content optimization. In Advances in Neural Information Processing Systems 21, pages 17–24, 2008.

[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.

[4] John C. Gittins. Multi-armed Bandit Allocation Indices. Wiley Interscience Series in Systems and Optimization. John Wiley & Sons Inc, 1989.

[5] Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML-10), pages 13–20, 2010.

[6] O.-C. Granmo. Solving two-armed bernoulli bandit problems using a bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics, 3(2):207–234, 2010.

[7] T.L. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6:4–22, 1985.

[8] J. Langford. Tutorial on practical prediction theory for classiﬁcation. Journal of Machine Learning Research, 6(1):273–306, 2005.

[9] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.

[10] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased ofﬂine evaluation of contextual-banditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011.

[11] Benedict C. May, Nathan Korda, Anthony Lee, and David S. Leslie. Optimistic Bayesian sampling in contextual-bandit problems. Technical Report 11:01, Statistics Group, Department of Mathematics, University of Bristol, 2011. Submitted to the Annals of Applied Probability.

[12] Benedict C. May and David S. Leslie. Simulation studies in optimistic Bayesian sampling in contextual-bandit problems. Technical Report 11:02, Statistics Group, Department of Mathematics, University of Bristol, 2011.

[13] J. Sarkar. One-armed bandit problems with covariates. The Annals of Statistics, 19(4):1978– 2002, 1991.

[14] S. Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26:639–658, 2010.

[15] Alexander L. Strehl, John Langford, Lihong Li, and Sham M. Kakade. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23 (NIPS10), pages 2217–2225, 2011.

[16] William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3–4):285–294, 1933.

[17] K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola. Feature hashing for large scale multitask learning. In ICML, 2009. 9