nips nips2013 nips2013-257 nips2013-257-reference knowledge-graph by maker-knowledge-mining

257 nips-2013-Projected Natural Actor-Critic


Source: pdf

Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan

Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1


reference text

[1] S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10:251–276, 1998. ˚ o

[2] K. J. Astr¨ m and T. H¨ gglund. PID Controllers: Theory, Design, and Tuning. ISA: The Instrumentation, a Systems, and Automation Society, 1995.

[3] M. T. S¨ ylemez, N. Munro, and H. Baki. Fast calculation of stabilizing PID controllers. Automatica, 39 o (1):121–126, 2003.

[4] C. L. Lynch and M. R. Popovic. Functional electrical stimulation. In IEEE Control Systems Magazine, volume 28, pages 40–50.

[5] E. K. Chadwick, D. Blana, A. J. van den Bogert, and R. F. Kirsch. A real-time 3-D musculoskeletal model for dynamic simulation of arm movements. In IEEE Transactions on Biomedical Engineering, volume 56, pages 941–948, 2009.

[6] K. Jagodnik and A. van den Bogert. A proportional derivative FES controller for planar arm movement. In 12th Annual Conference International FES Society, Philadelphia, PA, 2007. 8

[7] P. S. Thomas, M. S. Branicky, A. J. van den Bogert, and K. M. Jagodnik. Application of the actor-critic architecture to functional electrical stimulation control of a human arm. In Proceedings of the TwentyFirst Innovative Applications of Artificial Intelligence, 2009.

[8] T. J. Perkins and A. G. Barto. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3:803–832, 2003.

[9] H. Bendrahim and J. A. Franklin. Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems, 22:283–302, 1997.

[10] A. Arapostathis, R. Kumar, and S. P. Hsu. Control of markov chains with safety bounds. In IEEE Transactions on Automation Science and Engineering, volume 2, pages 333–343, October 2005.

[11] E. Arvelo and N. C. Martins. Control design for Markov chains under safety constraints: A convex approach. CoRR, abs/1209.2883, 2012.

[12] P. Geibel and F. Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research 24, pages 81–108, 2005.

[13] S. Kuindersma, R. Grupen, and A. G. Barto. Variational bayesian optimization for runtime risk-sensitive control. In Robotics: Science and Systems VIII, 2012.

[14] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009.

[15] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Technical Report UCB/EECS-2010-24, Electrical Engineering and Computer Sciences, University of California at Berkeley, March 2010.

[16] S. Amari and S. Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 1213–1216, 1998.

[17] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, New York, 1983.

[18] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 2003.

[19] S. Mahadevan and B. Liu. Sparse Q-learning with mirror descent. In Proceedings of the Conference on Unvertainty in Artificial Intelligence, 2012.

[20] S. Mahadevan, S. Giguere, and N. Jacek. Basis adaptation for sparse nonlinear reinforcement learning. In Proceedings of the Conference on Artificial Intelligence, 2013.

[21] R. Tyrell Rockafellar. Convex Analysis. Princeton University Press, Princeton, New Jersey, 1970.

[22] J. Nocedal and S. Wright. Numerical Optimization. Springer, second edition, 2006.

[23] S. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14, pages 1531–1538, 2002.

[24] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057– 1063, 2000.

[25] T. Morimura, E. Uchibe, and K. Doya. Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. In International Symposium on Information Geometry and its Application, 2005.

[26] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71:1180–1190, 2008.

[27] P. S. Thomas and A. G. Barto. Motor primitive discovery. In Procedings of the IEEE Conference on Development and Learning and EPigenetic Robotics, 2012.

[28] T. Degris, P. M. Pilarski, and R. S. Sutton. Model-free reinforcement learning with continuous action in practice. In Proceedings of the 2012 American Control Conference, 2012.

[29] P. S. Thomas. Bias in natural actor-critic algorithms. Technical Report UM-CS-2012-018, Department of Computer Science, University of Massachusetts at Amherst, 2012.

[30] D. Blana, R. F. Kirsch, and E. K. Chadwick. Combined feedforward and feedback control of a redundant, nonlinear, dynamic musculoskeletal system. Medical and Biological Engineering and Computing, 47: 533–542, 2009.

[31] P. Deegan. Whole-Body Strategies for Mobility and Manipulation. PhD thesis, University of Massachusetts Amherst, 2010.

[32] S. R. Kuindersma, E. Hannigan, D. Ruiken, and R. A. Grupen. Dexterous mobility with the uBot-5 mobile manipulator. In Proceedings of the 14th International Conference on Advanced Robotics, 2009. 9