nips nips2012 nips2012-3 nips2012-3-reference knowledge-graph by maker-knowledge-mining

3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Source: pdf

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efﬁcient than random selection. 1

reference text

[1] R. Akrour, M. Schoenauer, and M. Sebag. Preference-based policy learning. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Proc. ECML/PKDD’11, Part I, volume 6911 of Lecture Notes in Computer Science, pages 12–27. Springer, 2011.

[2] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50(1-2):5–43, 2003.

[3] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robot. Auton. Syst., 57(5):469–483, May 2009.

[4] J M Bernardo. Expected information as expected utility. Annals of Statistics, 7(3):686–690, 1979.

[5] Weiwei Cheng, Johannes F¨ rnkranz, Eyke H¨ llermeier, and Sang-Hyeun Park. Preferenceu u based policy iteration: Leveraging preference learning for reinforcement learning. In Proceedings of the 22nd European Conference on Machine Learning (ECML 2011), pages 312–327. Springer, 2011.

[6] Wei Chu and Zoubin Ghahramani. Preference learning with gaussian processes. In Proceedings of the 22nd international conference on Machine learning, ICML ’05, pages 137–144, New York, NY, USA, 2005. ACM.

[7] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991.

[8] Simon Duane, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics Letters B, 195(2):216 – 222, 1987.

[9] Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133–168, 1997.

[10] Michail G. Lagoudakis, Ronald Parr, and L. Bartlett. Least-squares policy iteration. Journal of Machine Learning Research, 4, 2003.

[11] D. V. Lindley. On a Measure of the Information Provided by an Experiment. The Annals of Mathematical Statistics, 27(4):986–1005, 1956.

[12] Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In ICML, pages 663–670, 2000.

[13] Bob Price and Craig Boutilier. Accelerating reinforcement learning through implicit imitation. J. Artif. Intell. Res. (JAIR), 19:569–629, 2003.

[14] Jette Randløv and Preben Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In ICML, pages 463–471, 1998.

[15] Stefan Schaal. Learning from demonstration. In NIPS, pages 1040–1046, 1996.

[16] Andrew I. Schein and Lyle H. Ungar. Active learning for logistic regression: an evaluation. Mach. Learn., 68(3):235–265, October 2007.

[17] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the ﬁfth annual workshop on Computational learning theory, COLT ’92, pages 287–294, New York, NY, USA, 1992. ACM. 9