nips nips2013 nips2013-273 nips2013-273-reference knowledge-graph by maker-knowledge-mining

273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes


Source: pdf

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1


reference text

[Brafman and Tennenholtz, 2002] Brafman, R. I. and Tennenholtz, M. (2002). R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231. [Bubeck and Slivkins, 2012] Bubeck, S. and Slivkins, A. (2012). The best of both worlds: Stochastic and adversarial bandits. Journal of Machine Learning Research - Proceedings Track, 23:42.1– 42.23. [Even-Dar et al., 2005] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2005). Experts in a markov decision process. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 17, pages 401–408. MIT Press, Cambridge, MA. [Even-Dar et al., 2009] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online markov decision processes. Math. Oper. Res., 34(3):726–736. [Iyengar, 2005] Iyengar, G. N. (2005). Robust dynamic programming. Math. Oper. Res., 30(2):257– 280. [Jaksch et al., 2010] Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res., 99:1563–1600. [Mannor et al., 2012] Mannor, S., Mebel, O., and Xu, H. (2012). Lightning does not strike twice: Robust mdps with coupled uncertainty. In ICML. [Mannor et al., 2007] Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N. (2007). Bias and variance approximation in value function estimates. Manage. Sci., 53(2):308–322. [McDiarmid, 1989] McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics, number 141 in London Mathematical Society Lecture Note Series, pages 148– 188. Cambridge University Press. [Neu et al., 2012] Neu, G., Gy¨ rgy, A., and Szepesv´ ri, C. (2012). The adversarial stochastic shorto a est path problem with unknown transition probabilities. Journal of Machine Learning Research - Proceedings Track, 22:805–813. [Neu et al., 2010] Neu, G., Gy¨ rgy, A., Szepesv´ ri, C., and Antos, A. (2010). Online markov decio a sion processes under bandit feedback. In NIPS, pages 1804–1812. [Nilim and El Ghaoui, 2005] Nilim, A. and El Ghaoui, L. (2005). Robust control of markov decision processes with uncertain transition matrices. Oper. Res., 53(5):780–798. [Puterman, 1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience. [Strens, 2000] Strens, M. (2000). A bayesian framework for reinforcement learning. In In Proceedings of the Seventeenth International Conference on Machine Learning, pages 943–950. ICML. [Tewari and Bartlett, 2007] Tewari, A. and Bartlett, P. (2007). Bounded parameter markov decision processes with average reward criterion. Learning Theory, pages 263–277. [Weissman et al., 2003] Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. (2003). Inequalities for the l1 deviation of the empirical distribution. Technical report, Information Theory Research Group, HP Laboratories. [Xu and Mannor, 2012] Xu, H. and Mannor, S. (2012). Distributionally robust markov decision processes. Math. Oper. Res., 37(2):288–300. [Yu and Mannor, 2009] Yu, J. Y. and Mannor, S. (2009). Arbitrarily modulated markov decision processes. In CDC, pages 2946–2953. [Yu et al., 2009] Yu, J. Y., Mannor, S., and Shimkin, N. (2009). Markov decision processes with arbitrary reward processes. Math. Oper. Res., 34(3):737–757. 9