nips nips2010 nips2010-203 nips2010-203-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári
Abstract: We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach. Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. 1
[1] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.
[3] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Pr, 2006.
[4] J. Audibert, R. Munos, and Cs. Szepesv´ ri. Tuning bandit algorithms in stochastic environa ments. Lecture Notes in Computer Science, 4754:150, 2007.
[5] C.C. Wang, S.R. Kulkarni, and H.V. Poor. Bandit problems with side observations. IEEE Transactions on Automatic Control, 50(3):338–355, 2005.
[6] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. Advances in Neural Information Processing Systems, pages 817–824, 2008.
[7] S. Pandey, D. Chakrabarti, and D. Agarwal. Multi-armed bandit problems with dependent arms. International Conference on Machine learning, pages 721–728, 2007.
[8] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback. Conference on Learning Theory, 2008.
[9] S.M. Kakade, S. Shalev-Shwartz, and A. Tewari. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th International Conference on Machine learning, pages 440–447. ACM, 2008.
[10] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
[11] Y. Abbasi-Yadkori, A. Antos, and Cs. Szepesv´ ri. Forced-exploration based algorithms for a playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, 2009.
[12] P. Rusmevichientong and J.N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
[13] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1989.
[14] K. Chen, I. Hu, and Z. Ying. Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Annals of Statistics, 27(4):1155–1163, 1999.
[15] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Advances in Neural Information Processing Systems, 14:601–608, 2002.
[16] V.H. De La Pena, M.J. Klass, and T.L. Lai. Self-normalized processes: exponential inequalities, moment bounds and iterated logarithm laws. Annals of Probability, 32(3):1902–1933, 2004.
[17] P. Rusmevichientong and J.N. Tsitsiklis. Linearly parameterized bandits. Arxiv preprint arXiv:0812.3465v2, 2008. 9