nips nips2009 nips2009-242 nips2009-242-reference knowledge-graph by maker-knowledge-mining

242 nips-2009-The Infinite Partially Observable Markov Decision Process

Source: pdf

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difﬁcult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We deﬁne an inﬁnite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

reference text

[1] R. Dearden, N. Friedman, and D. Andre, “Model based Bayesian exploration,” pp. 150–159, 1999.

[2] M. Strens, “A Bayesian framework for reinforcement learning,” in ICML, 2000.

[3] P. Poupart, N. Vlassis, J. Hoey, and K. Regan, “An analytic solution to discrete Bayesian reinforcement learning,” in ICML, (New York, NY, USA), pp. 697–704, ACM Press, 2006.

[4] R. Jaulmes, J. Pineau, and D. Precup, “Learning in non-stationary partially observable Markov decision processes,” ECML Workshop, 2005.

[5] S. Ross, B. Chaib-draa, and J. Pineau, “Bayes-adaptive POMDPs,” in Neural Information Processing Systems (NIPS), 2008.

[6] S. Ross, B. Chaib-draa, and J. Pineau, “Bayesian reinforcement learning in continuous POMDPs with application to robot navigation,” in ICRA, 2008.

[7] F. Doshi, J. Pineau, and N. Roy, “Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs,” in International Conference on Machine Learning, vol. 25, 2008.

[8] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen, “The inﬁnite hidden Markov model,” in Machine Learning, pp. 29–245, MIT Press, 2002.

[9] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1566–1581, 2006.

[10] J. V. Gael and Z. Ghahramani, Inference and Learning in Dynamic Models, ch. Nonparametric Hidden Markov Models. Cambridge University Press, 2010.

[11] Y. W. Teh, “Dirichlet processes.” Submitted to Encyclopedia of Machine Learning, 2007.

[12] J. Pineau, G. Gordon, and S. Thrun, “Point-based value iteration: An anytime algorithm for POMDPs,” IJCAI, 2003.

[13] M. T. J. Spaan and N. Vlassis, “Perseus: Randomized point-based value iteration for POMDPs,” Journal of Artiﬁcial Intelligence Research, vol. 24, pp. 195–220, 2005.

[14] T. Smith and R. Simmons, “Heuristic search value iteration for POMDPs,” in Proc. of UAI 2004, (Banff, Alberta), 2004.

[15] S. Ross, J. Pineau, S. Paquet, and B. Chaib-Draa, “Online planning algorithms for POMDPs,” Journal of Artiﬁcial Intelligence Research, vol. 32, pp. 663–704, July 2008.

[16] J. van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani, “Beam sampling for the inﬁnite hidden Markov model,” in ICML, vol. 25, 2008.

[17] R. Neal, “Slice sampling,” Annals of Statistics, vol. 31, pp. 705–767, 2000.

[18] C. K. Carter and R. Kohn, “On Gibbs sampling for state space models,” Biometrika, vol. 81, pp. 541–553, September 1994.

[19] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling, “Learning policies for partially observable environments: scaling up,” ICML, 1995.

[20] D. McAllester and S. Singh, “Approximate planning for factored POMDPs using belief state simpliﬁcation,” in UAI 15, 1999.

[21] L. Chrisman, “Reinforcement learning with perceptual aliasing: The perceptual distinctions approach,” in In Proceedings of the Tenth National Conference on Artiﬁcial Intelligence, pp. 183–188, AAAI Press, 1992.

[22] F. Doshi and N. Roy, “Efﬁcient model learning for dialog management,” in Proceedings of Human-Robot Interaction (HRI 2007), (Washington, DC), March 2007.

[23] P. Poupart and N. Vlassis, “Model-based Bayesian reinforcement learning in partially observable domains,” in ISAIM, 2008.

[24] J. H. Robert, R. St-aubin, A. Hu, and C. Boutilier, “SPUDD: Stochastic planning using decision diagrams,” in UAI, pp. 279–288, 1999.

[25] A. P. Wolfe, “POMDP homomorphisms,” in NIPS RL Workshop, 2006.

[26] M. O. Duff, Optimal learning: computational procedures for Bayes-adaptive markov decision processes. PhD thesis, 2002. 9