jmlr jmlr2011 jmlr2011-14 jmlr2011-14-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In The 21st Annual Conference on Learning Theory (COLT)., 2008. J. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res., 9999:2785–2836, December 2010. ISSN 1532-4435. URL http://portal.acm. org/citation.cfm?id=1953011.1953023. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235–256, 2002. ISSN 0885-6125. doi: http://dx.doi.org/10.1023/A: 1013689704352. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, 2003. ISSN 0097-5397. B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches. In STOC, pages 45–53, 2004. ISBN 1-58113-852-0. A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications, volume 2 of MPS/SIAM Series on Optimization. SIAM, Philadelphia, 2001. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Mach. Learn., 66(2-3):321–352, 2007. ISSN 0885-6125. V. Dani and T. P. Hayes. Robbing the bandit: less regret in online geometric optimization against an adaptive adversary. In SODA, pages 937–943, 2006. ISBN 0-89871-605-5. V. Dani, T. Hayes, and S. Kakade. The price of bandit information for online optimization. In NIPS, 2008. A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA, pages 385–394, 2005. ISBN 0-89871-585-7. J. Hannan. Approximation to bayes risk in repeated play. In M. Dresher, A. W. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume III, pages 97–139, 1957. E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In The 21st Annual Conference on Learning Theory (COLT)., 2008. 1310 B ETTER A LGORITHMS FOR B ENIGN BANDITS E. Hazan and S. Kale. Better algorithms for benign bandits. In SODA ’09: Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 38–47, Philadelphia, PA, USA, 2009a. Society for Industrial and Applied Mathematics. E. Hazan and S. Kale. On stochastic and worst-case models for investing. In Advances in Neural Information Processing Systems (NIPS) 22, 2009b. A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, March 1985. H. B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In COLT, pages 109–123, 2004. A.S. Nemirovskii. Interior point polynomial time methods in convex programming, 2004. Lecture Notes. Y. E. Nesterov and A. S. Nemirovskii. Interior Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia, 1994. J. A. Rice. Mathematical Statistics and Data Analysis. Duxbury Press, April 2001. ISBN 0534399428. URL http://www.amazon.com/exec/obidos/redirect?tag= citeulike07-20\&path;=ASIN/0534399428. H. Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5): 527–535, 1952. J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37–57, 1985. 1311