jmlr jmlr2011 jmlr2011-47 jmlr2011-47-reference knowledge-graph by maker-knowledge-mining

47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

Source: pdf

Author: Jaedeug Choi, Kee-Eung Kim

Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert’s behavior, namely the case in which the expert’s policy is explicitly given, and the case in which the expert’s trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings. Keywords: inverse reinforcement learning, partially observable Markov decision process, inverse optimization, linear programming, quadratically constrained programming

reference text

Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 1–8, Banff, Alta, Canada, 2004. Tilman Borgers and Rajiv Sarin. Naive reinforcement learning with endogenous aspirations. International Economic Review, 41(4):921–950, 2000. Anthony R. Cassandra, Leslie Pack Kaelbling, and Michael L. Littman. Acting optimally in partially observable stochastic domains. In Proceedings of the 12th National Conference on Artiﬁcial Intelligence (AAAI), pages 1023–1028, Seattle, WA, USA, 1994. Jaedeug Choi and Kee-Eung Kim. Inverse reinforecement learning in partially observable environments. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 1028–1033, Pasadena, CA, USA, 2009. Michael X Cohen and Charan Ranganath. Reinforcement learning signals predict future decisions. Journal of Neuroscience, 27(2):371–378, 2007. Ido Erev and Alvin E. Roth. Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review, 88(4): 848–881, 1998. Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1–2):79–103, 1999. Hector Geffner and Blai Bonet. Solving large POMDPs using real time dynamic programming. In Proceedings of the AAAI Fall Symposium Series, pages 61–68, 1998. Jacques Hadamard. Sur les problemes aux derivees partielles et leur signiﬁcation physique. Princeton University Bulletin, 13(1):49–52, 1902. Eric A. Hansen. Finite-Memory Control of Partially Observable Systems. PhD thesis, University of Massachusetts Amherst, 1998. Jesse Hoey, Axel Von Bertoldi, Pascal Poupart, and Alex Mihailidis. Assisting persons with dementia during handwashing using a partially observable Markov decision process. In Proceedings of the 5th International Conference on Vision Systems, Bielefeld University, Germany, 2007. Ed Hopkins. Adaptive learning models of consumer behavior. Journal of Economic Behavior and Organization, 64(3–4):348–368, 2007. Ronald. A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960. Shihao Ji, Ronald Parr, Hui Li, Xuejun Liao, and Lawrence Carin. Point-based policy iteration. In Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 1243–1249, Vancouver, BC, USA, 2007. 728 IRL IN PARTIALLY O BSERVABLE E NVIRONMENTS Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial Intelligence, 101(1–2):99–134, 1998. Rudolf E. Kalman. When is a linear control system optimal? Transaction ASME, Journal Basic Engineering, 86:51–60, 1964. Hanna Kurniawati, David Hsu, and Wee Sun Lee. SARSOP: Efﬁcient point-based POMDP planning by approximating optimally reachable belief spaces. In Proceedings of Robotics: Science and Systems, Zurich, Switzerland, 2008. Terran Lane and Carla E. Brodley. An empirical study of two approaches to sequence learning for anomaly detection. Machine Learning, 51(1):73–107, 2003. Daeyeol Lee, Michelle L. Conroy, Benjamin P. McGreevy, and Dominic J. Barraclough. Reinforcement learning and decision making in monkeys during a competitive game. Cognitive Brain Research, 22(1):45–58, 2004. George E. Monahan. A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28(1):1–16, 1982. P. Read Montague and Gregory S. Berns. Neural economics and the biological substrates of valuation. Neuron, 36(2):265–284, 2002. Gergely Neu and Csaba Szepesvari. Apprenticeship learning using inverse reinforcement learning and gradient methods. In Proceedings of the 23rd Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 295–302, Vancouver, BC, Canada, 2007. Gergely Neu and Csaba Szepesvari. Training parsers by inverse reinforcement learning. Machine Learning, pages 1–35, 2009. Allen Newell. The knowledge level. Artiﬁcial Intelligence, 18(1):87–127, 1982. Andrew Y. Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 663–670, Stanford University, Standord, CA, USA, 2000. Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning (ICML), pages 278–287, Bled, Slovenia, 1999. Yael Niv. Reinforcement learning in the brain. Journal of Mathematical Psychology, 53(3):139– 154, 2009. Joelle Pineau, Geoffrey Gordon, and Sebastian Thrun. Anytime point-based approximations for large POMDPs. Journal of Artiﬁcial Intelligence Research, 27:335–380, 2006. Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In Proceedings of 20th International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 2586–2591, Hyderabad, India, 2007. 729 C HOI AND K IM Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning, pages 729–736, Pittburgh, Pennsylvania, USA, 2006. Stuart Russell. Learning agents for uncertain environments (extended abstract). In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), pages 101–103, Madison, WI, USA, 1998. Trey Smith. Probabilistic planning for robotic exploration. PhD thesis, Carnegie Mellon University, The Robotics Institute, 2007. Trey Smith and Reid Simmons. Heuristic search value iteration for POMDPs. In Proceedings of the 20th Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 520–527, Banff, Canada, 2004. Trey Smith and Reid Simmons. Point-based POMDP algorithms: Improved analysis and implementation. In Proceedings of the 21st Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 542–547, Edinburgh, Scotland, 2005. Edward Jay Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University, 1971. Matthijs T. J. Spaan and Nikos Vlassis. A point-based pomdp algorithm for robot planning. In Proceedings of IEEE International Conference on Robotics and Automation, pages 2399–2404, New Orleans, LA, USA, 2004. Matthijs T. J. Spaan and Nikos Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artiﬁcial Intelligence Research, 24:195–220, 2005. Umar Syed and Robert E. Schapire. A game-theoretic approach to apprenticeship learning. In Proceedings of Neural Information Processing Systems (NIPS), pages 1449–1456, Vancouver, British Columbia, Canada, 2008. Umar Syed, Michael Bowling, and Robert Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th International Conference on Machine Learning, pages 1032–1039, Helsinki, Finland, 2008. Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured prediction models: a large margin approach. In Proceedings of the 22nd International Conference on Machine learning, pages 896–903, Bonn, Germany, 2005. Jason D. Williams and Steve Young. Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21(2):393–422, 2007. Qing Zhao, Lang Tong, Ananthram Swami, and Yunxia Chen. Decentralized cognitive mac for opportunistic spectrum access in ad hoc networks: A pomdp framework. IEEE Journal on Selected Areas in Communications, 25(3):589–600, 2007. Brian D. Ziebart, Andrew Maas, James Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 1433–1438, Chicago, IL, USA, 2008. 730