nips nips2013 nips2013-250 nips2013-250-reference knowledge-graph by maker-knowledge-mining

250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning


Source: pdf

Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz

Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.


reference text

[1] C. L. Isbell, C. Shelton, M. Kearns, S. Singh, and P. Stone, “A social reinforcement learning agent,” in Proc. of the 5th Intl. Conf. on Autonomous Agents, pp. 377–384, 2001.

[2] H. S. Chang, “Reinforcement learning with supervision by combining multiple learnings and expert advices,” in Proc. of the American Control Conference, 2006.

[3] W. B. Knox and P. Stone, “Tamer: Training an agent manually via evaluative reinforcement,” in Proc. of the 7th IEEE ICDL, pp. 292–297, 2008.

[4] A. Tenorio-Gonzalez, E. Morales, and L. Villaseor-Pineda, “Dynamic reward shaping: training a robot by voice,” in Advances in Artificial Intelligence–IBERAMIA, pp. 483–492, 2010.

[5] P. M. Pilarski, M. R. Dawson, T. Degris, F. Fahimi, J. P. Carey, and R. S. Sutton, “Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning,” in Proc. of the IEEE ICORR, pp. 1–7, 2011.

[6] A. L. Thomaz and C. Breazeal, “Teachable robots: Understanding human teaching behavior to build more effective robot learners,” Artificial Intelligence, vol. 172, no. 6-7, pp. 716–737, 2008.

[7] W. B. Knox and P. Stone, “Combining manual feedback with subsequent MDP reward signals for reinforcement learning,” in Proc. of the 9th Intl. Conf. on AAMAS, pp. 5–12, 2010.

[8] R. Dearden, N. Friedman, and S. Russell, “Bayesian Q-learning,” in Proc. of the 15th AAAI, pp. 761–768, 1998.

[9] C. Watkins and P. Dayan, “Q learning: Technical note,” Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992.

[10] T. Matthews, S. D. Ramchurn, and G. Chalkiadakis, “Competing with humans at fantasy football: Team formation in large partially-observable domains,” in Proc. of the 26th AAAI, pp. 1394–1400, 2012.

[11] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Proc. of the 16th ICML, pp. 341–348, 1999.

[12] C. L. Isbell, M. Kearns, S. Singh, C. R. Shelton, P. Stone, and D. Kormann, “Cobot in LambdaMOO: An Adaptive Social Statistics Agent,” JAAMAS, vol. 13, no. 3, pp. 327–354, 2006.

[13] W. B. Knox and P. Stone, “Reinforcement learning from simultaneous human and MDP reward,” in Proc. of the 11th Intl. Conf. on AAMAS, pp. 475–482, 2012.

[14] A. Y. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” in Proc. of the 17th ICML, 2000.

[15] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Proc. of the 21st ICML, 2004.

[16] C. Atkeson and S. Schaal, “Learning tasks from a single demonstration,” in Proc. of the IEEE ICRA, pp. 1706–1712, 1997.

[17] M. Taylor, H. B. Suay, and S. Chernova, “Integrating reinforcement learning with human demonstrations of varying ability,” in Proc. of the Intl. Conf. on AAMAS, pp. 617–624, 2011.

[18] L. P. Kaelbling, M. L. Littmann, and A. W. Moore, “Reinforcement learning: A survey,” JAIR, vol. 4, pp. 237–285, 1996.

[19] W. D. Smart and L. P. Kaelbling, “Effective reinforcement learning for mobile robots,” 2002.

[20] R. Maclin and J. W. Shavlik, “Creating advice-taking reinforcement learners,” Machine Learning, vol. 22, no. 1-3, pp. 251–281, 1996.

[21] L. Torrey, J. Shavlik, T. Walker, and R. Maclin, “Transfer learning via advice taking,” in Advances in Machine Learning I, Studies in Computational Intelligence (J. Koronacki, S. Wirzchon, Z. Ras, and J. Kacprzyk, eds.), vol. 262, pp. 147–170, Springer Berlin Heidelberg, 2010.

[22] C. Bailer-Jones and K. Smith, “Combining probabilities.” GAIA-C8-TN-MPIA-CBJ-053, 2011.

[23] M. L. Littman, G. A. Keim, and N. Shazeer, “A probabilistic approach to solving crossword puzzles,” Artificial Ingelligence, vol. 134, no. 1-2, pp. 23–55, 2002.

[24] G. Konidaris and A. Barto, “Autonomous shaping: Knowledge transfer in reinforcement learning,” in Proc. of the 23rd ICML, pp. 489–496, 2006. 9