nips nips2013 nips2013-165 nips2013-165-reference knowledge-graph by maker-knowledge-mining

165 nips-2013-Learning from Limited Demonstrations

Source: pdf

Author: Beomjoon Kim, Amir massoud Farahmand, Joelle Pineau, Doina Precup

Abstract: We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to deﬁne linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-ﬁnding task. 1

reference text

[1] S. Ross, G. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011. 1, 2, 6, 7, 8

[2] S. Chernova and M. Veloso. Interactive policy learning through conﬁdence-based autonomy. Journal of Artiﬁcial Intelligence Research, 34, 2009. 1, 8

[3] B. Argall, M. Veloso, and B. Browning. Teacher feedback to scaffold and reﬁne demonstrated motion primitives on a mobile robot. Robotics and Autonomous Systems, 59(3-4), 2011. 1, 8

[4] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. 1

[5] Cs. Szepesv´ ri. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 1, 2 a

[6] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4: 1107–1149, 2003. 2, 6

[7] D. P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310–335, 2011. 2

[8] A.-m. Farahmand, M. Ghavamzadeh, Cs. Szepesv´ ri, and S. Mannor. Regularized policy iteration. In a NIPS 21, 2009. 2, 3

[9] J. Z. Kolter and A. Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In ICML, 2009. 2

[10] G. Taylor and R. Parr. Kernelized value function approximation for reinforcement learning. In ICML, 2009. 2

[11] M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman. Finite-sample analysis of Lasso-TD. In ICML, 2011. 2

[12] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. 3, 5

[13] I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, and Y. Singer. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(2):1453–1484, 2006. 3

[14] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based ﬁtted policy iteration and a single sample path. Machine Learning, 71:89–129, 2008. 3

[15] A.-m. Farahmand and Cs. Szepesv´ ri. Model selection in reinforcement learning. Machine Learning, 85 a (3):299–332, 2011. 4

[16] R. Munos. Error bounds for approximate policy iteration. In ICML, 2003. 4

[17] A.-m. Farahmand, R. Munos, and Cs. Szepesv´ ri. Error propagation for approximate policy and value a iteration. In NIPS 23, 2010. 4

[18] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94–116, January 1994. 4

[19] P.-M. Samson. Concentration of measure inequalities for Markov chains and φ-mixing processes. The Annals of Probability, 28(1):416–461, 2000. 4

[20] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classiﬁcation: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005. 5

[21] T. Hester, M. Quinlan, and P. Stone. RTMBA: A real-time model-based reinforcement learning architecture for robot control. In ICRA, 2012. 5

[22] CVX Research, Inc. CVX: Matlab software for disciplined convex programming, version 2.0. http: //cvxr.com/cvx, August 2012. 5

[23] S. Ross and J. A. Bagnell. Efﬁcient reductions for imitation learning. In AISTATS, 2010. 8

[24] W. B Knox and P. Stone. Reinforcement learning from simultaneous human and MDP reward. In AAMAS, 2012. 8

[25] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In IJCAI, 2007. 8

[26] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008. 8

[27] A. J. Ijspeert, J. Nakanishi, and S. Schaal. Learning attractor landscapes for learning motor primitives. In NIPS 15, 2002. 8

[28] J. Kober and J. Peters. Policy search for motor primitives in robotics. Machine Learning, 84(1-2):171– 203, 2011. 8

[29] A. Coates, P. Abbeel, and A. Y. Ng. Learning for control from multiple demonstrations. In ICML, 2008. 8 9