nips nips2013 nips2013-165 nips2013-165-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Beomjoon Kim, Amir massoud Farahmand, Joelle Pineau, Doina Precup
Abstract: We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to define linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-finding task. 1
[1] S. Ross, G. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011. 1, 2, 6, 7, 8
[2] S. Chernova and M. Veloso. Interactive policy learning through confidence-based autonomy. Journal of Artificial Intelligence Research, 34, 2009. 1, 8
[3] B. Argall, M. Veloso, and B. Browning. Teacher feedback to scaffold and refine demonstrated motion primitives on a mobile robot. Robotics and Autonomous Systems, 59(3-4), 2011. 1, 8
[4] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. 1
[5] Cs. Szepesv´ ri. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 1, 2 a
[6] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4: 1107–1149, 2003. 2, 6
[7] D. P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310–335, 2011. 2
[8] A.-m. Farahmand, M. Ghavamzadeh, Cs. Szepesv´ ri, and S. Mannor. Regularized policy iteration. In a NIPS 21, 2009. 2, 3
[9] J. Z. Kolter and A. Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In ICML, 2009. 2
[10] G. Taylor and R. Parr. Kernelized value function approximation for reinforcement learning. In ICML, 2009. 2
[11] M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman. Finite-sample analysis of Lasso-TD. In ICML, 2011. 2
[12] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. 3, 5
[13] I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, and Y. Singer. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(2):1453–1484, 2006. 3
[14] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 71:89–129, 2008. 3
[15] A.-m. Farahmand and Cs. Szepesv´ ri. Model selection in reinforcement learning. Machine Learning, 85 a (3):299–332, 2011. 4
[16] R. Munos. Error bounds for approximate policy iteration. In ICML, 2003. 4
[17] A.-m. Farahmand, R. Munos, and Cs. Szepesv´ ri. Error propagation for approximate policy and value a iteration. In NIPS 23, 2010. 4
[18] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94–116, January 1994. 4
[19] P.-M. Samson. Concentration of measure inequalities for Markov chains and φ-mixing processes. The Annals of Probability, 28(1):416–461, 2000. 4
[20] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005. 5
[21] T. Hester, M. Quinlan, and P. Stone. RTMBA: A real-time model-based reinforcement learning architecture for robot control. In ICRA, 2012. 5
[22] CVX Research, Inc. CVX: Matlab software for disciplined convex programming, version 2.0. http: //cvxr.com/cvx, August 2012. 5
[23] S. Ross and J. A. Bagnell. Efficient reductions for imitation learning. In AISTATS, 2010. 8
[24] W. B Knox and P. Stone. Reinforcement learning from simultaneous human and MDP reward. In AAMAS, 2012. 8
[25] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In IJCAI, 2007. 8
[26] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008. 8
[27] A. J. Ijspeert, J. Nakanishi, and S. Schaal. Learning attractor landscapes for learning motor primitives. In NIPS 15, 2002. 8
[28] J. Kober and J. Peters. Policy search for motor primitives in robotics. Machine Learning, 84(1-2):171– 203, 2011. 8
[29] A. Coates, P. Abbeel, and A. Y. Ng. Learning for control from multiple demonstrations. In ICML, 2008. 8 9