nips nips2010 nips2010-183 nips2010-183-reference knowledge-graph by maker-knowledge-mining

183 nips-2010-Non-Stochastic Bandit Slate Problems

Source: pdf

Author: Satyen Kale, Lev Reyzin, Robert E. Schapire

Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efﬁcient algorithms which have regret O( T ), where the constant depends on the speciﬁc nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1

reference text

[1] A BERNETHY, J., H AZAN , E., AND R AKHLIN , A. Competing in the dark: An efﬁcient algorithm for bandit linear optimization. In COLT (2008), pp. 263–274. 8

[2] AUER , P., C ESA -B IANCHI , N., AND F ISCHER , P. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47, 2-3 (2002), 235–256.

[3] AUER , P., C ESA -B IANCHI , N., F REUND , Y., AND S CHAPIRE , R. E. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 1 (2002), 48–77.

[4] AWERBUCH , B., AND K LEINBERG , R. Online linear optimization and adaptive routing. J. Comput. Syst. Sci. 74, 1 (2008), 97–114.

[5] BARTLETT, P. L., DANI , V., H AYES , T. P., K AKADE , S., R AKHLIN , A., AND T EWARI , A. High-probability regret bounds for bandit online linear optimization. In COLT (2008), pp. 335–342.

[6] B REGMAN , L. The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Mathematics and Mathematical Physics 7 (1967), 200–217.

[7] B RUALDI , R. A., AND L EE , G. M. On the truncated assignment polytope. Linear Algebra and its Applications 19 (1978), 33–62.

[8] C ENSOR , Y., AND Z ENIOS , S. Parallel optimization. Oxford University Press, 1997.

[9] C ESA -B IANCHI , N., AND L UGOSI , G. Combinatorial bandits. In COLT (2009). ¨ ´

[10] G Y ORGY, A., L INDER , T., L UGOSI , G., AND OTTUCS AK , G. The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research 8 (2007), 2369– 2403.

[11] H AZAN , E., AND K ALE , S. Better algorithms for benign bandits. In SODA (2009), pp. 38–47.

[12] H ELMBOLD , D. P., AND WARMUTH , M. K. Learning permutations with exponential weights. In COLT (2007), pp. 469–483.

[13] H ERBSTER , M., AND WARMUTH , M. K. Tracking the best linear predictor. Journal of Machine Learning Research 1 (2001), 281–309.

[14] KOOLEN , W. M., WARMUTH , M. K., AND K IVINEN , J. Hedging structured concepts. In COLT (2010).

[15] L AI , T., AND ROBBINS , H. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics 6 (1985), 4–22.

[16] M ENDELSOHN , N. S., AND D ULMAGE , A. L. The convex hull of sub-permutation matrices. Proceedings of the American Mathematical Society 9, 2 (Apr 1958), 253–254.

[17] U CHIYA , T., NAKAMURA , A., AND K UDO , M. Algorithms for adversarial bandit problems with multiple plays. In ALT (2010), pp. 375–389.

[18] WARMUTH , M. K., AND K UZMIN , D. Randomized PCA algorithms with regret bounds that are logarithmic in the dimension. In In Proc. of NIPS (2006). 9