nips nips2011 nips2011-128 nips2011-128-reference knowledge-graph by maker-knowledge-mining

128 nips-2011-Improved Algorithms for Linear Stochastic Bandits


Source: pdf

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1


reference text

Y. Abbasi-Yadkori, A. Antos, and Cs. Szepesv´ ri. Forced-exploration based algorithms for playing a in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, 2009. N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37:263293, 2003. A. Antos, V. Grover, and Cs. Szepesv´ ri. Active learning in heteroscedastic noise. Theoretical a Computer Science, 411(29-30):2712–2728, 2010. J.-Y. Audibert, R. Munos, and Csaba Szepesv´ ri. Exploration-exploitation tradeoff using variance a estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009. P. Auer. Using upper confidence bounds for online learning. In FOCS, pages 270–279, 2000. P. Auer. Using confidence bounds for exploitation-exploration trade-offs. JMLR, 2002. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002. S. Bubeck, R. Munos, G. Stoltz, and Cs. Szepesv´ ri. Online optimization in X-armed bandits. In a NIPS, pages 201–208, 2008. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. 2006. W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In AISTATS, 2011. P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In UAI, 2007. V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Rocco Servedio and Tong Zhang, editors, COLT, pages 355–366, 2008. V. H. de la Pe˜ a, M. J. Klass, and T. L. Lai. Self-normalized processes: exponential inequalities, n moment bounds and iterated logarithm laws. Annals of Probability, 32(3):1902–1933, 2004. V. H. de la Pe˜ a, T. L. Lai, and Q.-M. Shao. Self-normalized processes: Limit theory and Statistical n Applications. Springer, 2009. O. Dekel, C. Gentile, and K. Sridharan. Robust selective sampling from single and multiple teachers. In COLT, 2010. D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100–118, 1975. A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary bandit problems. Technical report, LTCI, 2008. R. Kleinberg, A. Niculescu-Mizil, and Y. Sharma. Regret bounds for sleeping experts and bandits. Machine learning, pages 1–28, 2008. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154–166, 1982. T. L. Lai, H. Robbins, and C. Z. Wei. Strong consistency of least squares estimates in multiple regression. Proceedings of the National Academy of Sciences, 75(7):3034–3036, 1979. L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW 2010), pages 661–670. ACM, 2010. V. Mnih, Cs. Szepesv´ ri, and J.-Y. Audibert. Empirical Bernstein stopping. pages 672–679, 2008. a H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58:527–535, 1952. P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010. G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, 1990. T. J. Walsh, I. Szita, C. Diuk, and M. L. Littman. Exploring compact reinforcement-learning representations with linear regression. In UAI, pages 591–598. AUAI Press, 2009. 9