nips nips2011 nips2011-56 nips2011-56-reference knowledge-graph by maker-knowledge-mining

56 nips-2011-Committing Bandits

Source: pdf

Author: Loc X. Bui, Ramesh Johari, Shie Mannor

Abstract: We consider a multi-armed bandit problem where there are two phases. The ﬁrst phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.

reference text

[1] R. Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33(6):1926–1951, 1995.

[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit problem. Machine Learning Journal, 47(2-3):235–256, 2002.

[3] P. Auer and R. Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.

[4] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in ﬁnitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832–1852, 2011.

[5] P. A. Coquelin and R. Munos. Bandit algorithms for tree search. CoRR, abs/cs/0703062, 2007.

[6] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 2006.

[7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In STOC, pages 681–690, 2008.

[8] L. Kocsis and C. Szepesv´ ri. Bandit based Monte-Carlo planning. In ECML, pages 282–293, a 2006.

[9] T. L. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.

[10] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing (NIPS), 2008.

[11] L. Li, W. Chu, J. Langford, and R.E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661–670, 2010.

[12] S. Mannor. k-armed bandit. In Encyclopedia of Machine Learning, pages 561–563. 2010.

[13] S. Mannor and J. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623–648, 2004.

[14] P. Rusmevichientong and J. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010. 9