nips nips2011 nips2011-61 nips2011-61-reference knowledge-graph by maker-knowledge-mining

61 nips-2011-Contextual Gaussian Process Bandit Optimization

Source: pdf

Author: Andreas Krause, Cheng S. Ong

Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process deﬁned over the joint context-action space, and develop CGP-UCB, an intuitive upper-conﬁdence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1

reference text

[1] Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. JMLR, 3, 2002.

[2] John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2008.

[3] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML, 2010.

[4] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

[5] Wei Chu, Lihong Li, Lev Reyzin, , and Robert E. Schapire. Contextual bandits with linear payoff functions. In AISTATS, 2011.

[6] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, 2010.

[7] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Interscience, 1991.

[8] G. Wahba. Spline Models for Observational Data. SIAM, 1990.

[9] Naoki Abe, Alan W. Biermann, and Philip M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.

[10] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic gait optimization with Gaussian process regression. In IJCAI, pages 944–949, 2007.

[11] C. Widmer, N. Toussaint, Y. Altun, and G. R¨ tsch. Inferring latent task structure for multitask learning a by multiple kernel learning. BMC Bioinformatics, 11(Suppl 8:S5), 2010.

[12] B. Peters et. al. A community resource benchmarking predictions of peptide binding to mhc-i molecules. PLoS Computational Biology, 2(6):e65, 2006.

[13] T. L. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules. Adv. Appl. Math., 6:4, 1985.

[14] V. Dani, T. P. Hayes, and S. Kakade. The price of bandit information for online optimization. In NIPS, 2007.

[15] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In STOC, pages 681–690, 2008.

[16] L. Kocsis and C. Szepesv´ ri. Bandit based monte-carlo planning. In ECML, 2006. a

[17] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235–256, 2002.

[18] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In COLT, 2008.

[19] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv´ ri. Online optimization in X-armed bandits. In NIPS, a 2008.

[20] S. Gr¨ new¨ lder, J-Y. Audibert, M. Opper, and J. Shawe-Taylor. Regret bounds for gaussian process bandit u a problems. In AISTATS, 2010.

[21] Aleksandrs Slivkins. Contextual bandits with similarity information. Technical Report 0907.3986, arXiv, 2009.

[22] Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010.

[23] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple tasks. In ICML, 2005.

[24] Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K. I. Williams. Multi-task gaussian process prediction. In NIPS, 2008. ´

[25] Mauricio A. Alvarez, David Luengo, Michalis K. Titsias, and Neil D. Lawrence. Efﬁcient multioutput gaussian processes through variational inducing kernels. In AISTATS, 2010.

[26] E. Brochu, M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. In TR-2009-23, UBC, 2009.

[27] Eric Brochu, Tyson Brochu, and Nando de Freitas. A bayesian interactive optimization approach to procedural animation design. In Eurographics, 2010. 9