nips nips2011 nips2011-177 nips2011-177-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Aleksandrs Slivkins
Abstract: The multi-armed bandit (MAB) setting is a useful abstraction of many online learning tasks which focuses on the trade-off between exploration and exploitation. In this setting, an online algorithm has a fixed set of alternatives (“arms”), and in each round it selects one arm and then observes the corresponding reward. While the case of small number of arms is by now well-understood, a lot of recent work has focused on multi-armed bandits with (infinitely) many arms, where one needs to assume extra structure in order to make the problem tractable. In particular, in the Lipschitz MAB problem there is an underlying similarity metric space, known to the algorithm, such that any two arms that are close in this metric space have similar payoffs. In this paper we consider the more realistic scenario in which the metric space is implicit – it is defined by the available structure but not revealed to the algorithm directly. Specifically, we assume that an algorithm is given a tree-based classification of arms. For any given problem instance such a classification implicitly defines a similarity metric space, but the numerical similarity information is not available to the algorithm. We provide an algorithm for this setting, whose performance guarantees (almost) match the best known guarantees for the corresponding instance of the Lipschitz MAB problem. 1
[1] Rajeev Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33(6):1926– 1951, 1995.
[2] Peter Auer, Nicol` Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. o Machine Learning, 47(2-3):235–256, 2002. Preliminary version in 15th ICML, 1998.
[3] Peter Auer, Nicol` Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed o bandit problem. SIAM J. Comput., 32(1):48–77, 2002. Preliminary version in 36th IEEE FOCS, 1995.
[4] Peter Auer, Ronald Ortner, and Csaba Szepesv´ ri. Improved Rates for the Stochastic Continuum-Armed a Bandit Problem. In 20th COLT, pages 454–468, 2007.
[5] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. J. of Computer and System Sciences, 74(1):97–114, February 2008. Preliminary version in 36th ACM STOC, 2004.
[6] Andrei Broder, Marcus Fontoura, Vanja Josifovski, and Lance Riedel. A semantic approach to contextual advertising. In 30th SIGIR, pages 559–566, 2007.
[7] S´ bastien Bubeck, R´ mi Munos, Gilles Stoltz, and Csaba Szepesvari. Online Optimization in X-Armed e e Bandits. J. of Machine Learning Research (JMLR), 12:1587–1627, 2011. Preliminary version in NIPS 2008.
[8] Nicol` Cesa-Bianchi and G´ bor Lugosi. Prediction, learning, and games. Cambridge Univ. Press, 2006. o a
[9] Eric Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. IEEE Trans. on Automatic Control, 54(6):1243–1253, 2009. A manuscript from 2004.
[10] Varsha Dani and Thomas P. Hayes. Robbing the bandit: less regret in online geometric optimization against an adaptive adversary. In 17th ACM-SIAM SODA, pages 937–943, 2006.
[11] Varsha Dani, Thomas P. Hayes, and Sham Kakade. The Price of Bandit Information for Online Optimization. In 20th NIPS, 2007.
[12] Abraham Flaxman, Adam Kalai, and H. Brendan McMahan. Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient. In 16th ACM-SIAM SODA, pages 385–394, 2005.
[13] Sylvain Gelly and David Silver. Combining online and offline knowledge in UCT. In 24th ICML, 2007.
[14] Sylvain Gelly and David Silver. Achieving master level play in 9x9 computer go. In 23rd AAAI, 2008.
[15] Anupam Gupta, Robert Krauthgamer, and James R. Lee. Bounded geometries, fractals, and low– distortion embeddings. In 44th IEEE FOCS, pages 534–543, 2003.
[16] Sham M. Kakade, Adam T. Kalai, and Katrina Ligett. Playing Games with Approximation Algorithms. In 39th ACM STOC, 2007.
[17] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th NIPS, 2004.
[18] Robert Kleinberg. Online Decision Problems with Large Strategy Sets. PhD thesis, MIT, 2005.
[19] Robert Kleinberg and Aleksandrs Slivkins. Sharp Dichotomies for Regret Minimization in Metric Spaces. In 21st ACM-SIAM SODA, 2010.
[20] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-Armed Bandits in Metric Spaces. In 40th ACM STOC, pages 681–690, 2008.
[21] Levente Kocsis and Csaba Szepesvari. Bandit Based Monte-Carlo Planning. In 17th ECML, pages 282– 293, 2006.
[22] T.L. Lai and Herbert Robbins. Asymptotically efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 6:4–22, 1985.
[23] H. Brendan McMahan and Avrim Blum. Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary. In 17th COLT, pages 109–123, 2004.
[24] R´ mi Munos and Pierre-Arnaud Coquelin. Bandit algorithms for tree search. In 23rd UAI, 2007. e
[25] Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. Bandits for Taxonomies: A Model-based Approach. In SDM, 2007.
[26] Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. Multi-armed Bandit Problems with Dependent Arms. In 24th ICML, 2007.
[27] Susan T. Dumais Paul N. Bennett, Krysta Marie Svore. Classification-enhanced ranking. In 19th WWW, pages 111–120, 2010.
[28] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In 25th ICML, pages 784–791, 2008.
[29] Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Learning optimally diverse rankings over large document collections. In 27th ICML, pages 983–990, 2010. 9