nips nips2012 nips2012-348 nips2012-348-reference knowledge-graph by maker-knowledge-mining

348 nips-2012-Tractable Objectives for Robust Policy Optimization

Source: pdf

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efﬁciently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efﬁciently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

reference text

[1] Carlo Acerbi. Spectral Measures of Risk: a Coherent Representation of Subjective Risk Aversion. Journal of Baking and Finance, 2002.

[2] D. Billings, N. Burch, A. Davidson, R. Holte, J. Schaeffer, T. Schauenberg, and D. Szafron. Approximating game-theoretic optimal strategies for full-scale poker. Proceedings of the Eighteenth International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2003.

[3] Erick Delage. Distributionally Robust Optimization in context of Data-driven Problems. PhD thesis, Stanford University, 2009.

[4] Erick Delage and Shie Mannor. Percentile Optimization in Uncertain Markov decision processes with Application to Efﬁcient Exploration. Proceedings of the 24th International Conference on Machine Learning (ICML), 2007.

[5] Samid Hoda, Andrew Gilpin, Javier Pe˜ a, and Tuomas Sandholm. Smoothing techniques for computing n Nash equilibria of sequential games. Mathematics of Operations Research, 35(2):494–512, 2010.

[6] Michael Johanson, Nolan Bard, Neil Burch, and Michael Bowling. Finding optimal abstract strategies in extensive-form games. Proceedings of the 26th Conference on Artiﬁcial Intelligence (AAAI), 2012.

[7] Marc Lanctot, Richard Gibson, Neil Burch, Martin Zinkevich, and Michael Bowling. No-regret learning in extensive-form games with imperfect recall. Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.

[8] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo Sampling for Regret Minimization in Extensive Games. Advances in Neural Information Processing Systems (NIPS), 2009.

[9] Shie Mannor, Oﬁr Mebel, and Huan Xu. Lighting Does Not Strike Twice: Robust MDPs with Coupled Uncertainty. Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.

[10] Shie Mannor, Duncan Simester, Peng Sun, and John N. Tsitsiklis. Bias and variance in value function estimation. Management Science, 53(2):308–322, February 2007.

[11] Susan A. Murphy and James R. McKay. Adaptive treatment strategies: an emerging approach for improving treatment effectiveness. (Newsletter of the American Psychological Association Division 12, section III: The Society for the Science of Clinical Psychology), 2003.

[12] Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov decision processes with Uncertain Transition matrices. Operations Research, 53(5):780–798, October 2006.

[13] R. Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions. Journal of Baking and Finance, 26:1443–1471, 2002.

[14] A. J. Rush, M. Fava, S. R. Wisniewski, P.W. Lavori, M. H. Trivedi, H. A. Sackeim, M. E. Thase, A. A. Nierenberg, F. M. Quitkin, T.M. Kashner, D.J. Kupfer, J. F. Rosenbaum, J. Alpert, J. W. Stewart, P. J. McGrath, M. M. Biggs, K. Shores-Wilson, B. D. Lebowitz, L. Ritz, and G. Niederehe. Sequenced treatment alternatives to relieve depression (STAR*D): rational and design. Controlled Clinical Trials, 25(1):119– 142, 2004.

[15] Susan A. Shortreed, Eric Laber, Daniel J. Lizotte, T. Scott Stroup, Joelle Pineau, and Susan A. Murphy. Informing sequential clinical decision-making through Reinforcement learning: an empirical study. Machine Learning, 84(1-2):109–136, July 2011.

[16] T. Scott Stroup, J.P. McEvoy, M.S. Swartz, M.J. Byerly, I.D. Glick, J.M Canive, M. McGee, G.M. Simpson, M.D. Stevens, and J.A. Lieberman. The National Institute of Mental Health clinical antipschotic trials of intervention effectiveness (CATIE) project: schizophrenia trial design and protocol development. Schizophrenia Bulletin, 29(1):15–31, 2003.

[17] M.S. Swartz, D.O. Perkins, T.S. Stroup, J.P. McEvoy, J.M. Nieri, and D.D. Haal. Assessing clinical and functional outcomes in the clinical antipsychotic of intervention effectiveness (CATIE) schizophrenia trial. Schizophrenia Bulletin, 1(33-43), 29.

[18] Kevin Waugh, Martin Zinkevich, Michael Johanson, Morgan Kan, David Schnizlein, and Michael Bowling. A practical use of imperfect recall. Proceedings of the Eighth Symposium on Abstraction, Reformulation and Approximation (SARA), 2009.

[19] Huan Xu and Shie Mannor. On robustness/performance tradeoffs in linear programming and markov decision processes. Operations Research, 2005.

[20] Huan Xu and Shie Mannor. Distributionally Robust Markov decision processes. Advances in Neural Information Processing Systems (NIPS), 2010.

[21] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret Minimization in Games with Incomplete Information. Advances in Neural Information Processing Systems (NIPS), 2008. 9