nips nips2008 nips2008-140 nips2008-140-reference knowledge-graph by maker-knowledge-mining

140 nips-2008-Mortal Multi-Armed Bandits

Source: pdf

Author: Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, Eli Upfal

Abstract: We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indeﬁnitely and exploration is reduced once an optimal arm is identiﬁed with nearcertainty. The main motivation for our setting is online-advertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budgets. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the typical ad lifetime. We present an optimal algorithm for the state-aware (deterministic reward function) case, and build on this technique to obtain an algorithm for the state-oblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms signiﬁcantly outperform the standard multi-armed bandit approaches applied to these settings. 1

reference text

[1] P. Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. J. Machine Learning Research, 3:397–422, 2002.

[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47:235–256, 2002.

[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, 2002.

[4] D. A. Berry, R. W. Chen, A. Zame, D. C. Heath, and L. A. Shepp. Bandit problems with inﬁnitely many arms. The Annals of Statistics, 25(5):2103–2116, 1997.

[5] D. A. Berry and B. Fristedt. Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London, UK, 1985.

[6] D. Bertsimas and J. Nino-Mora. Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Operations Research, 48(1):80–90, 2000.

[7] A. Blum and Y. Mansour. From external to internal regret. In 18th COLT, pages 621–636, 2005.

[8] W. Feller. An Introduction to Probability Theory and Its Applications, Volume 2. Wiley, 1971.

[9] Y. Freund, R. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that specialize. In 29th STOC, pages 334–343, 1997.

[10] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments. In J. G. et al., editor, Progress in Statistics, pages 241–266. North-Holland, 1974.

[11] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32:151–178, 1998.

[12] R. Kleinberg. Online Decision Problems with Large Strategy Sets. PhD thesis, MIT, 2005.

[13] R. D. Kleinberg, A. Niculescu-Mizil, and Y. Sharma. Regret bounds for sleeping experts and bandits. In 21st COLT, pages 425–436, 2008.

[14] A. Krause and C. Guestrin. Nonmyopic active learning of Gaussian processes: An explorationexploitation approach. In 24th ICML, pages 449–456, 2007.

[15] J. Nino-Mora. Restless bandits, partial conservation laws and indexability. Adv. Appl. Prob., 33:76–98, 2001.

[16] S. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for taxonomies: A model-based approach. In SDM, pages 216–227, 2007.

[17] S. Pandey, D. Chakrabarti, and D. Agarwal. Multi-armed bandit problems with dependent arms. In ICML, pages 721–728, 2007.

[18] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queueing network control. In 9th CCC, pages 318–322, 1994.

[19] A. Slivkins and E. Upfal. Adapting to a changing environment: The Brownian restless bandits. In 21st COLT, pages 343–354, 2008.

[20] O. Teytaud, S. Gelly, and M. Sebag. Anytime many-armed bandits. In CAP, 2007.

[21] T.Lai and H.Robbins. Asymptotically efﬁcient adaptive allocation rules. Adv. Appl. Math., 6:4–22, 1985.

[22] P. Whittle. Arm-acquiring bandits. The Annals of Probability, 9(2):284–292, 1981.

[23] P. Whittle. Restless bandits: Activity allocation in a changing world. J. of Appl. Prob., 25A:287–298, 1988.