nips nips2013 nips2013-348 nips2013-348-reference knowledge-graph by maker-knowledge-mining

348 nips-2013-Variational Policy Search via Trajectory Optimization

Source: pdf

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

reference text

[1] A. Bagnell, S. Kakade, A. Ng, and J. Schneider. Policy search by dynamic programming. In Advances in Neural Information Processing Systems (NIPS), 2003.

[2] M. Deisenroth and C. Rasmussen. PILCO: a model-based and data-efﬁcient approach to policy search. In International Conference on Machine Learning (ICML), 2011.

[3] T. Furmston and D. Barber. Variational methods for reinforcement learning. Journal of Machine Learning Research, 9:241–248, 2010.

[4] D. Jacobson and D. Mayne. Differential Dynamic Programming. Elsevier, 1970.

[5] S. Julier and J. Uhlmann. A new extension of the Kalman ﬁlter to nonlinear systems. In International Symposium on Aerospace/Defense Sensing, Simulation, and Control, 1997.

[6] S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), 2002.

[7] M. Kalakrishnan, S. Chitta, E. Theodorou, P. Pastor, and S. Schaal. STOMP: stochastic trajectory optimization for motion planning. In International Conference on Robotics and Automation, 2011.

[8] J. Kober and J. Peters. Learning motor primitives for robotics. In International Conference on Robotics and Automation, 2009.

[9] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.

[10] S. Levine and V. Koltun. Guided policy search. In International Conference on Machine Learning (ICML), 2013.

[11] G. Neumann. Variational inference for policy search in changing situations. In International Conference on Machine Learning (ICML), 2011.

[12] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008.

[13] K. Rawlik, M. Toussaint, and S. Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In Robotics: Science and Systems, 2012.

[14] S. Ross, G. Gordon, and A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. Journal of Machine Learning Research, 15:627–635, 2011.

[15] L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81(393):82–86, 1986.

[16] E. Todorov. Policy gradients in linearly-solvable MDPs. In Advances in Neural Information Processing Systems (NIPS 23), 2010.

[17] E. Todorov and W. Li. A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In American Control Conference, 2005.

[18] E. Todorov and Y. Tassa. Iterative local dynamic programming. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2009.

[19] M. Toussaint. Robot trajectory optimization using approximate inference. In International Conference on Machine Learning (ICML), 2009.

[20] M. Toussaint, L. Charlin, and P. Poupart. Hierarchical POMDP controller optimization by likelihood maximization. In Uncertainty in Artiﬁcial Intelligence (UAI), 2008.

[21] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning model-free robot control by a Monte Carlo EM algorithm. Autonomous Robots, 27(2):123–130, 2009.

[22] K. Yin, K. Loken, and M. van de Panne. SIMBICON: simple biped locomotion control. ACM Transactions Graphics, 26(3), 2007.

[23] B. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, Carnegie Mellon University, 2010. 9