nips nips2011 nips2011-175 nips2011-175-reference knowledge-graph by maker-knowledge-mining

175 nips-2011-Multi-Bandit Best Arm Identification

Source: pdf

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, Sébastien Bubeck

Abstract: We study the problem of identifying the best arm in each of the bandits in a multibandit multi-armed setting. We ﬁrst propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.

reference text

[1] J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identiﬁcation in multi-armed bandits. In Proceedings of the Twenty-Third Annual Conference on Learning Theory, pages 41–53, 2010.

[2] Jean-Yves Audibert, R´ mi Munos, and Csaba Szepesv´ ri. Tuning bandit algorithms in stochase a tic environments. In Marcus Hutter, Rocco Servedio, and Eiji Takimoto, editors, Algorithmic Learning Theory, volume 4754 of Lecture Notes in Computer Science, pages 150–165. Springer Berlin / Heidelberg, 2007.

[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47:235–256, 2002.

[4] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandit problems. In Proceedings of the Twentieth International Conference on Algorithmic Learning Theory, pages 23–37, 2009.

[5] K. Deng, J. Pineau, and S. Murphy. Active learning for personalizing treatment. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2011.

[6] C. Dimitrakakis and M. Lagoudakis. Rollout sampling approximate policy iteration. Machine Learning Journal, 72(3):157–171, 2008.

[7] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 2006.

[8] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck. Multi-bandit best arm identiﬁcation. Technical Report 00632523, INRIA, 2011.

[9] M. Lagoudakis and R. Parr. Reinforcement learning as classiﬁcation: Leveraging modern classiﬁers. In Proceedings of the Twentieth International Conference on Machine Learning, pages 424–431, 2003.

[10] O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classiﬁcation and function approximation. In Proceedings of Advances in Neural Information Processing Systems 6, 1993.

[11] A. Maurer and M. Pontil. Empirical bernstein bounds and sample-variance penalization. In 22th annual conference on learning theory, 2009.

[12] V. Mnih, Cs. Szepesv´ ri, and J.-Y. Audibert. Empirical Bernstein stopping. In Proceedings of a the Twenty-Fifth International Conference on Machine Learning, pages 672–679, 2008.

[13] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527–535, 1952. 9