nips nips2013 nips2013-231 nips2013-231-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir
Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switch� ing costs, the attainable rate with bandit feedback is Θ(T 2/3 ). Interestingly, this √ rate is significantly worse than the Θ( T ) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. 1
[1] J. Abernethy and A. Rakhlin. Beating the adaptive bandit with high probability. In COLT, 2009.
[2] R. Agrawal, M.V. Hedge, and D. Teneketzis. Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control, 33(10):899–906, 1988.
[3] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, 2012.
[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
[5] A. Borodin and R. El-Yaniv. Online computation and competitive analysis. Cambridge University Press, 1998.
[6] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv´ ri. X-armed bandits. Journal of Machine a Learning Research, 12:1655–1695, 2011.
[7] N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Regret minimization for reserve prices in second-price auctions. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA13), 2013.
[8] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
[9] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2/3):321–352, 2007.
[10] V. Dani and T. P. Hayes. Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2006.
[11] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and System Sciences, 55(1):119–139, 1997.
[12] A. Gyorgy and G. Neu. Near-optimal rates for limited-delay universal lossy source coding. In IEEE International Symposium on Information Theory, pages 2218–2222, 2011.
[13] T. Jun. A survey on the bandit problem with switching costs. De Economist, 152:513–541, 2004.
[14] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71:291–307, 2005.
[15] N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.
[16] O. Maillard and R. Munos. Adaptive bandits: Towards the best history-dependent strategy. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
[17] H. B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In Proceedings of the Seventeenth Annual Conference on Learning Theory, 2004.
[18] N. Merhav, E. Ordentlich, G. Seroussi, and M.J. Weinberger. Sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48(7):1947–1958, 2002.
[19] C. Mesterharm. Online learning with delayed label feedback. In Proceedings of the Sixteenth International Conference on Algorithmic Learning Theory, 2005.
[20] R. Ortner. Online regret bounds for Markov decision processes with deterministic transitions. Theoretical Computer Science, 411(29–30):2684–2695, 2010.
[21] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. CoRR, abs/1209.2388, 2012. 9