nips nips2013 nips2013-24 nips2013-24-reference knowledge-graph by maker-knowledge-mining

24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs


Source: pdf

Author: Prashanth L.A., Mohammad Ghavamzadeh

Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1


reference text

[1] D. Bertsekas. Nonlinear programming. Athena Scientific, 1999.

[2] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45 (11):2471–2482, 2009.

[3] S. Bhatnagar, H. Prasad, and L.A. Prashanth. Stochastic Recursive Algorithms for Optimization, volume 434. Springer, 2013.

[4] V. Borkar. A sensitivity formula for the risk-sensitive cost and the actor-critic algorithm. Systems & Control Letters, 44:339–346, 2001.

[5] V. Borkar. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27:294–311, 2002.

[6] H. Chen, T. Duncan, and B. Pasik-Duncan. A Kiefer-Wolfowitz algorithm with randomized differences. IEEE Transactions on Automatic Control, 44(3):442–453, 1999.

[7] E. Delage and S. Mannor. Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research, 58(1):203–213, 2010.

[8] J. Filar, L. Kallenberg, and H. Lee. Variance-penalized Markov decision processes. Mathematics of Operations Research, 14(1):147–161, 1989.

[9] J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision processes. IEEE Transaction of Automatic Control, 40(1):2–10, 1995.

[10] R. Howard and J. Matheson. Risk sensitive Markov decision processes. Management Science, 18(7): 356–369, 1972.

[11] V. Katkovnik and Y. Kulchitsky. Convergence of a class of random search algorithms. Automatic Remote Control, 8:81–87, 1972.

[12] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.

[13] J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In Proceedings of the Sixteenth European Conference on Machine Learning, pages 280–291, 2005.

[14] L.A. Prashanth and S. Bhatnagar. Reinforcement learning with average cost for adaptive control of traffic lights at intersections. In Proceedings of the Fourteenth International IEEE Conference on Intelligent Transportation Systems, pages 1640–1645. IEEE, 2011.

[15] L.A. Prashanth and S. Bhatnagar. Reinforcement Learning With Function Approximation for Traffic Signal Control. IEEE Transactions on Intelligent Transportation Systems, 12(2):412–421, june 2011.

[16] L.A. Prashanth and S. Bhatnagar. Threshold Tuning Using Stochastic Optimization for Graded Signal Control. IEEE Transactions on Vehicular Technology, 61(9):3865–3880, Nov. 2012.

[17] L.A. Prashanth and M. Ghavamzadeh. Actor-Critic Algorithms for Risk-Sensitive MDPs. Technical report inria-00794721, INRIA, 2013.

[18] W. Sharpe. Mutual fund performance. Journal of Business, 39(1):119–138, 1966.

[19] M. Sobel. The variance of discounted Markov decision processes. Applied Probability, pages 794–802, 1982.

[20] J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–341, 1992.

[21] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of Advances in Neural Information Processing Systems 12, pages 1057–1063, 2000.

[22] A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, pages 387–396, 2012.

[23] A. Tamar, D. Di Castro, and S. Mannor. Temporal difference methods for the variance of the reward to go. In Proceedings of the Thirtieth International Conference on Machine Learning, pages 495–503, 2013.

[24] H. Xu and S. Mannor. Distributionally robust Markov decision processes. Mathematics of Operations Research, 37(2):288–300, 2012. 9