jmlr jmlr2012 jmlr2012-86 jmlr2012-86-reference knowledge-graph by maker-knowledge-mining

86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

Source: pdf

Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie

Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with signiﬁcant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We ﬁnd that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling

reference text

R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27:1054–1078, 1995. S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. arXiv:1111.1797v1, 2011. J. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Proc. 25th Conf. on Uncertainty in Artiﬁcial Intelligence, pages 19–26, Corvallis, Oregon, 2009. AUAI Press. J.-Y. Audibert and S. Bubeck. Exploration–exploitation tradeoff using variance estimates in multiarmed bandits. Theoretical Computer Science, 410:1876–1902, 2009. J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11:2785–2836, 2010. J.-Y. Audibert, R. Munos, and C. Szepesv´ ri. Tuning bandit algorithms in stochastic environments. a In International Conference on Algorithmic Learning Theory (ALT), pages 150–165. Springer, 2007. P. Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002. L. Breiman. Probability. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. M. Brezzi and T.L. Lai. Incomplete learning from endogenous data in dynamic allocation. Econometrica, 68:1511–1516, 2000. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In 25th Annu. Conf. on Neural Information Processing Systems, NIPS 2011, Granada, Spain, 2011. W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In 14th Internat. Conf. on Artiﬁcial Intelligence and Statistics, AISTATS, Fort Lauderdale, Florida, USA, 2011. F. Eicker. Asymptotic normality and consistency of the least squares estimators for families of linear regressions. Annals of Mathematical Statistics, 34:447–456, 1963. S. Filippi, O. Capp´ , A. Garivier, and C. Szepesv´ ri. Parametric bandits: The generalized linear e a case. In 24th Annu. Conf. on Neural Information Processing Systems, pages 586–594. Curran Associates, Inc., 2010. 2104 O PTIMISTIC BAYESIAN S AMPLING A. Garivier and O. Capp´ . The KL-UCB algorithm for bounded stochastic bandits and beyond. In e 24th Annu. Conf. on Learning Theory, COLT 2011, Budapest, Hungary, 2011. J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, B, 41:148–177, 1979. T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In Proc. of the 27th Internat. Conf. on Machine Learning, pages 13–20, Haifa, Israel, 2010. Omnipress. O.-C. Granmo. A Bayesian learning automaton for solving two-armed bernoulli bandit problems. In 7th Internat. Conf. on Machine Learning and Applications, San Diego, California, USA, 2008. IEEE Computer Society. L. P. Kaelbling. Associative reinforcement learning: Functions in k-DNF. Machine Learning, 15: 279–298, 1994. T.L. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathemtics, 6:4–22, 1985. L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proc. 19th Internat. Conf. on World Wide Web, WWW 2010, pages 661–670, Raleigh, North Carolina, USA, 2010. ACM. L. Li, W. Chu, J. Langford, and X. Wang. Unbiased ofﬂine evaluation of contextual-bandit-based news article recommendation algorithms. In Proc. 4th ACM Internat. Conf. on Web Search and Data Mining, WSDM ’11, pages 297–306, 2011. M. L. Littman. A generalized reinforcement-learning model: Convergence and applications. In Proc. 13th Internat. Conf. on Machine Learning, pages 310–318. Morgan Kaufmann, 1996. N. Meuleau and P. Bourgine. Exploration of multi-state environments: Local measures and backpropagation of uncertainty. Machine Learning, 35:117–154, 1999. N. G. Pavlidis, D. K. Tasoulis, and D. J. Hand. Simulation studies of multi-armed bandits with covariates. In Proc. 10th Internat. Conf. on Computer Modelling, Cambridge, UK, 2008. S. L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26:639–658, 2010. S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesv´ ri. Convergence results for single-step ona policy reinforcement-learning algorithms. Machine Learning, 39:287–308, 2000. A. Slivkins. Contextual bandits with similarity information. arXiv:0907.3986v3, 2011. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294, 1933. 2105 M AY, KORDA , L EE AND L ESLIE T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In Proc. 22nd Internat. Conf. on Machine Learning, pages 956–963, New York, NY, USA, 2005. ACM. Yahoo! Academic Relations. Yahoo! front page today module user click log dataset, version 1.0, 2011. URL http://webscope.sandbox.yahoo.com. Y. Yang and D. Zhu. Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Annals of Statistics, 30:100–121, 2002. D. Zwillinger. CRC Standard Probability and Statistics Tables and Formulae. CRC Press, 2000. 2106