jmlr jmlr2010 jmlr2010-97 jmlr2010-97-reference knowledge-graph by maker-knowledge-mining

97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

Source: pdf

Author: Jean-Yves Audibert, Sébastien Bubeck

Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efﬁcient and bandit label efﬁcient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a uniﬁed γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which deﬁnes a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus ﬁll in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modiﬁcation of the upper conﬁdence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efﬁcient, upper conﬁdence bound (UCB) policy, online learning, prediction with limited feedback.

reference text

C. Allenberg, P. Auer, L. Gy¨ rﬁ, and G. Ottucs´ k. Hannan consistency in on-line learning in case of o a unbounded losses under partial monitoring. In ALT, volume 4264 of Lecture Notes in Computer Science, pages 229–243. Springer, 2006. J.-Y. Audibert, R. Munos, and Cs. Szepesv´ ri. Exploration-exploitation trade-off using variance a estimates in multi-armed bandits. Theoretical Computer Science, 410:1876–1902, 2009. P. Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a rigged casino: the adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pages 322–331. IEEE Computer Society Press, 1995. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 47(2-3):235–256, 2002a. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002b. K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19:357–367, 1967. N. Cesa-Bianchi. Analysis of two gradient-based algorithms for on-line regression. Journal of Computer and System Sciences, 59(3):392–411, 1999. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997. N. Cesa-Bianchi, G. Lugosi, and G. Stoltz. Minimizing regret with label efﬁcient prediction. IEEE: Transactions on Information Theory, 51:2152–2162, 2005. D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3:100–118, 1975. A. Gy¨ rgy and G. Ottucs´ k. Adaptive routing using expert advice. Computer Journal-Oxford, 49 o a (2):180–189, 2006. D. Helmbold and S. Panizza. Some label efﬁcient learning results. In Proceedings of the 10th annual conference on Computational learning theory, pages 218–230. ACM New York, NY, USA, 1997. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527–535, 1952. 2836