nips nips2010 nips2010-152 nips2010-152-reference knowledge-graph by maker-knowledge-mining

152 nips-2010-Learning from Logged Implicit Exploration Data

Source: pdf

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “ofﬂine” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.

reference text

[1] Peter Auer, Nicol` C. Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed o bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.

[2] D. Horvitz and D. Thompson. A generalization of sampling without replacement from a ﬁnite universe. Journal of the American Statistical Association, 47, 1952.

[3] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. Approximate planning in large pomdps via reusable trajectories. In NIPS, 2000.

[4] Diane Lambert and Daryl Pregibon. More bang for their bucks: Assessing new features for online advertisers. In ADKDD 2007, 2007.

[5] John Langford, Alexander L. Strehl, and Jenn Wortman. Exploration scavenging. In ICML-08: Proceedings of the 25rd international conference on Machine learning, 2008.

[6] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 20, pages 817–824, 2008. 8

[7] Robert A. Lew. Bounds on negative moments. SIAM Journal on Applied Mathematics, 30(4):728–731, 1976.

[8] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the Nineteenth International Conference on World Wide Web (WWW-10), pages 661–670, 2010.

[9] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased ofﬂine evaluation of contextualbandit-based news article recommendation algorithms. In Proceedings of the Fourth International Conference on Web Search and Web Data Mining (WSDM-11), 2011.

[10] Art Owen and Yi Zhou. Safe and effective importance sampling. Journal of the American Statistical Association, 95:135–143, 1998.

[11] Doina Precup, Rich Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In ICML, 2000. 9