nips nips2007 nips2007-30 nips2007-30-reference knowledge-graph by maker-knowledge-mining

30 nips-2007-Bayes-Adaptive POMDPs

Source: pdf

Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau

Abstract: Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the Bayes-Adaptive POMDP. This new model allows us to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can tradeoff between improving the model, identifying the state, and gathering reward. We show how the model can be ﬁnitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates. 1

reference text

[1] R. Dearden, N. Friedman, and N. Andre. Model based bayesian exploration. In UAI, 1999.

[2] M. Duff. Optimal Learning: Computational Procedure for Bayes-Adaptive Markov Decision Processes. PhD thesis, University of Massachusetts, Amherst, USA, 2002.

[3] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete bayesian reinforcement learning. In Proc. ICML, 2006.

[4] R. Jaulmes, J. Pineau, and D. Precup. Active learning in partially observable markov decision processes. In ECML, 2005.

[5] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial Intelligence, 101:99–134, 1998.

[6] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: an anytime algorithm for POMDPs. In IJCAI, pages 1025–1032, Acapulco, Mexico, 2003.

[7] M. Spaan and N. Vlassis. Perseus: randomized point-based value iteration for POMDPs. JAIR, 24:195– 220, 2005.

[8] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In UAI, Banff, Canada, 2004.

[9] S. Paquet, L. Tobin, and B. Chaib-draa. An online POMDP algorithm for complex multiagent environments. In AAMAS, 2005.

[10] Jonathan Baxter and Peter L. Bartlett. Inﬁnite-horizon policy-gradient estimation. Journal of Artiﬁcial Intelligence Research (JAIR), 15:319–350, 2001.

[11] St´ phane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. Technical Report SOCSe TR-2007.6, McGill University, 2007.

[12] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods In Practice. Springer, 2001.