nips nips2004 nips2004-126 nips2004-126-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Robert D. Kleinberg
Abstract: In the multi-armed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when there is an infinite strategy set. Here we consider the case when the set of strategies is a subset of Rd , and the cost functions are continuous. In the d = 1 case, we improve on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor. We also consider the case where d > 1 and the cost functions are convex, adapting a recent online convex optimization algorithm of Zinkevich to the sparser feedback model of the multi-armed bandit problem. 1
[1] R. AGRAWAL . The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926-1951, 1995.
[2] P. AUER , N. C ESA -B IANCHI , AND P. F ISCHER . Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47:235-256, 2002.
[3] P. AUER , N. C ESA -B IANCHI , Y. F REUND , AND R. S CHAPIRE . Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of FOCS 1995.
[4] B. AWERBUCH AND R. K LEINBERG . Near-Optimal Adaptive Routing: Shortest Paths and Geometric Generalizations. In Proceedings of STOC 2004.
[5] N. BANSAL , A. B LUM , S. C HAWLA , AND A. M EYERSON . Online oblivious routing. In Proceedings of SPAA 2003: 44-49.
[6] A. B LUM , C. B URCH , AND A. K ALAI . Finely-competitive paging. In Proceedings of FOCS 1999.
[7] A. B LUM , S. C HAWLA , AND A. K ALAI . Static Optimality and Dynamic Search-Optimality in Lists and Trees. Algorithmica 36(3): 249-260 (2003).
[8] A. B LUM , V. K UMAR , A. RUDRA , AND F. W U . Online learning in online auctions. In Proceedings of SODA 2003.
[9] A. B LUM AND H. B. M C M AHAN . Online geometric optimization in the bandit setting against an adaptive adversary. In Proceedings of COLT 2004.
[10] D. B ERRY AND L. P EARSON . Optimal Designs for Two-Stage Clinical Trials with Dichotomous Responses. Statistics in Medicine 4:487 - 508, 1985.
[11] E. C OPE . Regret and Convergence Bounds for Immediate-Reward Reinforcement Learning with Continuous Action Spaces. Preprint, 2004.
[12] A. F LAXMAN , A. K ALAI , AND H. B. M C M AHAN . Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient. To appear in Proceedings of SODA 2005.
[13] Y. F REUND AND R. S CHAPIRE . Adaptive Game Playing Using Multiplicative Weights. Games and Economic Behavior 29:79-103, 1999.
[14] R. G RAMACY, M. WARMUTH , S. B RANDT, AND I. A RI . Adaptive Caching by Refetching. In Advances in Neural Information Processing Systems 15, 2003.
[15] R. K LEINBERG AND T. L EIGHTON . The Value of Knowing a Demand Curve: Bounds on Regret for On-Line Posted-Price Auctions. In Proceedings of FOCS 2003.
[16] A. K ALAI AND S. V EMPALA . Efficient algorithms for the online decision problem. In Proceedings of COLT 2003.
[17] J. K IEFER AND J. W OLFOWITZ . Stochastic Estimation of the Maximum of a Regression Function. Annals of Mathematical Statistics 23:462-466, 1952.
[18] T. L. L AI AND H. ROBBINS . Asymptotically efficient adaptive allocations rules. Adv. in Appl. Math. 6:4-22, 1985.
[19] C. M ONTELEONI AND T. JAAKKOLA . Online Learning of Non-stationary Sequences. In Advances in Neural Information Processing Systems 16, 2004.
[20] M. ROTHSCHILD . A Two-Armed Bandit Theory of Market Pricing. Journal of Economic Theory 9:185-202, 1974.
[21] M. Z INKEVICH . Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In Proceedings of ICML 2003, 928-936.