nips nips2011 nips2011-210 nips2011-210-reference knowledge-graph by maker-knowledge-mining

210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

Source: pdf

Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette

Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is signiﬁcantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1

reference text

Peter Auer, Nicol` Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. o Machine Learning, 47, 2002a. Peter Auer, Nicol` Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit o problem. SIAM Journal of Computing, 32(1), 2002b. Arindam Banerjee. On Bayesian bounds. In Proceedings of the International Conference on Machine Learning (ICML), 2006. Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings on the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2011. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. Leslie Pack Kaelbling. Associative reinforcement learning: Functions in k-DNF. Machine Learning, 15, 1994. John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems (NIPS), 2007. David McAllester. Some PAC-Bayesian theorems. In Proceedings of the International Conference on Computational Learning Theory (COLT), 1998. Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classiﬁcation. Journal of Machine Learning Research, 2002. Yevgeny Seldin and Naftali Tishby. PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11, 2010. Yevgeny Seldin, Nicol` Cesa-Bianchi, Peter Auer, Francois Laviolette, and John Shawe-Taylor. PAC-Bayeso ¸ Bernstein inequality for martingales and its application to multiarmed bandits. 2011a. In review. Preprint available at http://arxiv.org/abs/1110.6755. Yevgeny Seldin, Francois Laviolette, Nicol` Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. PAC-Bayesian ¸ o inequalities for martingales. 2011b. In review. Preprint available at http://arxiv.org/abs/1110.6886. John Shawe-Taylor and Robert C. Williamson. A PAC analysis of a Bayesian estimator. In Proceedings of the International Conference on Computational Learning Theory (COLT), 1997. John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998. Alexander L. Strehl, Chris Mesterharm, Michael L. Littman, and Haym Hirsh. Experience-efﬁcient learning in associative bandit problems. In Proceedings of the International Conference on Machine Learning (ICML), 2006. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. Naftali Tishby and Daniel Polani. Information theory of decisions and actions. In Vassilis Cutsuridis, Amir Hussain, John G. Taylor, and Daniel Polani, editors, Perception-Reason-Action Cycle: Models, Algorithms and Systems. Springer, 2010. Naftali Tishby, Fernando Pereira, and William Bialek. The information bottleneck method. In Allerton Conference on Communication, Control and Computation, 1999. Leslie G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27 (11), 1984. 9