nips nips2010 nips2010-14 nips2010-14-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
[1] Pieter Abbeel and Andrew Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning, 2004.
[2] P Abbeel and A Y Ng. Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd International Conference on Machine Learning, 2005.
[3] Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning, 2006.
[4] Umar Syed and Robert E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems 20, 2008.
[5] J. Zico Kolter, Pieter Abbeel, and Andrew Ng. Hierarchical apprenticeship learning with application to quadruped locomotion. In Advances in Neural Information Processing Systems 20, 2008.
[6] Umar Syed and Robert E. Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th International Conference on Machine Learning, 2008.
[7] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
[8] St´ phane Ross and J. Andrew Bagnell. Efficient reduction for imitation learning. In AISTATS, e 2010.
[9] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by costproportionate example weighting. In Proceedings of the Third IEEE International Conference on Data Mining, 2003.
[10] J. Andrew Bagnell, Sham Kakade, Andrew Y. Ng, and Jeff Schneider. Policy search by dynamic programming. In Advances in Neural Information Processing Systems 15, 2003.
[11] John Langford and Bianca Zadrozny. Relating reinforcement learning performance to classification performance. In Proceedings of the 22nd International Conference on Machine Learning, 2005.
[12] Doron Blatt and Alfred Hero. From weighted classification to policy search. In Advances in Neural Information Processing Systems 18, pages 139–146, 2006.
[13] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings 19th International Conference on Machine Learning, 2002.
[14] V. N. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16:264–280, 1971. 9