nips nips2012 nips2012-295 nips2012-295-reference knowledge-graph by maker-knowledge-mining

295 nips-2012-Risk-Aversion in Multi-armed Bandits

Source: pdf

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difﬁcult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we deﬁne two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

reference text

[1] András Antos, Varun Grover, and Csaba Szepesvári. Active learning in heteroscedastic noise. Theoretical Computer Science, 411:2712–2728, June 2010.

[2] P Artzner, F Delbaen, JM Eber, and D Heath. Coherent measures of risk. Mathematical ﬁnance, (June 1996):1–24, 1999.

[3] Jean-Yves Audibert and Sébastien Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11:2785–2836, 2010.

[4] Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identiﬁcation in multiarmed bandits. In Proceedings of the Twenty-third Conference on Learning Theory (COLT’10), 2010.

[5] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science, 410:1876– 1902, 2009.

[6] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47:235–256, 2002.

[7] David B. Brown. Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters, 35:722–730, 2007.

[8] Eyal Even-Dar, Michael Kearns, and Jennifer Wortman. Risk-sensitive online learning. In Proceedings of the 17th international conference on Algorithmic Learning Theory (ALT’06), pages 199–213, 2006.

[9] Christian Gollier. The Economics of Risk and Time. The MIT Press, 2001.

[10] Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952.

[11] Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of Probability, 18(3):pp. 1269–1283, 1990.

[12] J Neumann and O Morgenstern. Theory of games and economic behavior. Princeton University, Princeton, 1947.

[13] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the AMS, 58:527–535, 1952.

[14] Antoine Salomon and Jean-Yves Audibert. Deviations of stochastic bandit regret. In Proceedings of the 22nd international conference on Algorithmic learning theory (ALT’11), pages 159–173, 2011.

[15] Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-aversion in multi-arm bandit. Technical Report hal-00750298, INRIA, 2012.

[16] Manfred K. Warmuth and Dima Kuzmin. Online variance minimization. In Proceedings of the 19th Annual Conference on Learning Theory (COLT’06), pages 514–528, 2006. 9