jmlr jmlr2006 jmlr2006-74 jmlr2006-74-reference knowledge-graph by maker-knowledge-mining

74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

Source: pdf

Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart

Abstract: We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) deﬁned on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally deﬁned on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties, we extend the P ERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend P ERSEUS to deal with continuous action and observation sets by designing effective sampling approaches. Keywords: planning under uncertainty, partially observable Markov decision processes, continuous state space, continuous action space, continuous observation space, point-based value iteration

reference text

˚ o K. J. Astr¨ m. Optimal control of Markov decision processes with incomplete state estimation. Journal of Mathematical Analysis and Applications, 10:174–205, 1965. 2363 P ORTA , V LASSIS , S PAAN AND P OUPART D. Aberdeen and J. Baxter. Scaling internal-state policy-gradient methods for POMDPs. In Proceedings of the International Conference on Machine Learning, pages 3–10, Sydney, Australia, 2002. B. Bakker. Reinforcement learning with long short-term memory. In Advances in Neural Information Processing Systems 15 (NIPS-2002), pages 1475–1482. MIT Press, 2003. J. Baxter and P. L. Bartlett. Inﬁnite-horizon policy-gradient estimation. Journal of Artiﬁcial Intelligence Research, 15:319–350, 2001. R. E. Bellman. Dynamic Programming. Princenton University Press, 1957. D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientiﬁc, Belmont, MA, 2001. 2nd Edition. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, MA, 1996. J. Boger, P. Poupart, J. Hoey, C. Boutilier, G. Fernie, and A. Mihailidis. A decision-theoretic approach to task assistance for persons with dementia. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, pages 1293–1299, Edinburgh, Scotland, 2005. C. Boutilier. A POMDP formulation of preference elicitation problems. In Proceedings of the National Conference on Artiﬁcial Intelligence, pages 239–246, Edmonton, AB, 2002. C. Boutilier and D. Poole. Computing optimal policies for partially observable decision processes using compact representations. In Proceedings of the National Conference on Artiﬁcial Intelligence, pages 1168–1175, Portland, OR, 1996. A. Brooks, A. Makarenko, S. Williams, and H. Durrant-Whyte. Planning in continuous state spaces with parametric POMDPs. In Reasoning with Uncertainty in Robotics, Workshop of the International Joint Conference on Artiﬁcial Intelligence, pages 40–47, 2006. A. R. Cassandra, L. P. Kaelbling, and J. A. Kurien. Acting under uncertainty: discrete Bayesian models for mobile-robot navigation. In Proceedings of the International Conference on Intelligent Robots and Systems, pages 963–972, 1996. A. R. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, exact algorithm for partially observable Markov decision processes. In Proceedings of Uncertainty in Artiﬁcial Intelligence, pages 54–61, 1997. H. T. Cheng. Algorithms for Partially Observable Markov Decision Processes. PhD thesis, University of British Columbia, 1988. T. Darrell and A. P. Pentland. Active gesture recognition using partially observable Markov decision processes. In IEEE International Conference on Pattern Recognition, pages 984–988, Vienna, Austria, 1996. F. Dellaert, D. Fox, W. Burgard, and S. Thrun. Monte-Carlo localization for mobile robots. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 1322– 1328, 1999. 2364 P OINT-BASED VALUE I TERATION FOR C ONTINUOUS POMDP S A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo in Practice. Springer-Verlag, New York, 2001. M. Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massassachusetts Amherst, 2002. E. B. Dynkin. Controlled random sequences. Theory of probability and its applications, 10(1): 1–14, 1965. D. Fox. Kld-sampling: Adaptive particle ﬁlters. In Advances in Neural Information Processing Systems 14 (NIPS-2001), pages 713–720. MIT Press, 2002. D. Fox. Adapting the sample size in particle ﬁlters through kld-sampling. International Journal of Robotics Research, 22(10-11):985–1004, 2003. J. Goldberger and S. Roweis. Hierarchical clustering of a mixture model. In Advances in Neural Information Processing Systems 17 (NIPS-2004), pages 505–512. MIT Press, 2005. J. Hoey and P. Poupart. Solving pomdps with continuous or large discrete observation spaces. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, pages 1332–1338, 2005. M. Isard and A. Blake. Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998. L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial Intelligence, 101(1-2):99–134, 1998. C. Lusena, J. Goldsmith, and M. Mundhenk. Nonapproximability results for partially observable Markov decision processes. Journal of Artiﬁcial Intelligence Research, 14:83–103, 2001. O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning and inﬁnitehorizon partially observable Markov decision problems. In Proceedings of the National Conference on Artiﬁcial Intelligence, pages 541–548, Orlando, FL, 1999. N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning ﬁnite-state controllers for partially observable environments. In Proceedings of Uncertainty in Artiﬁcial Intelligence, pages 427–436, Stockholm, 1999. G. E. Monahan. A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28(1):1–16, 1982. M. Montemerlo, J. Pineau, N. Roy, S. Thrun, and V. Verma. Experiences with a mobile robotic guide for the elderly. In Proceedings of the National Conference on Artiﬁcial Intelligence, pages 587–592, Edmonton, AB, 2002. A. Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of Uncertainty in Artiﬁcial Intelligence, pages 406–415, Stanford, CA, 2000. C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operations Research, 12(3):441–450, 1987. 2365 P ORTA , V LASSIS , S PAAN AND P OUPART J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for pomdps. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, pages 1025–1032, 2003a. J. Pineau, M. Montemerlo, M. Pollack, N. Roy, and S. Thrun. Towards robotic assistants in nursing homes: challenges and results. Robotics and Autonomous Systems, 42(3-4):271–281, 2003b. M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle ﬁlters. Journal of the American Statististical Association, 94(446):590–599, 1999. J. M. Porta and B. J. A. Kr¨ se. Appearance-based concurrent map building and localization using o a multi-hypotheses tracker. In Proceedings of the International Conference on Intelligent Robots and Systems, pages 3424–3429, Sendai, Japan, 2004. J. M. Porta, M. T. J. Spaan, and N. Vlassis. Robot planning in partially observable continuous domains. In Robotics: Science and Systems I, pages 217–224, MIT, Cambridge, MA, 2005. P. Poupart and C. Boutilier. Value-directed compressions of POMDPs. In Advances in Neural Information Processing Systems 15 (NIPS-2002), pages 1547–1554. MIT Press, 2003. P. Poupart and C. Boutilier. Bounded ﬁnite state controllers. In Advances in Neural Information Processing Systems 16 (NIPS-2003), pages 823–830. MIT Press, 2004. P. Poupart and C. Boutilier. VDCBPI: an approximate scalable algorithm for large POMDPs. In Advances in Neural Information Processing Systems 17 (NIPS-2004), pages 1081–1088. MIT Press, 2005. P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 697–704, 2006. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Mathematical Statistics. John Wiley and Sons, Inc., 1994. N. Roy. Finding Approximate POMDP Solutions Through Belief Compression. PhD thesis, Carnegie Mellon University, 2003. N. Roy and G. Gordon. Exponential family PCA for belief compression in POMDPs. In Advances in Neural Information Processing Systems 15 (NIPS-2002), pages 1635–1642. MIT Press, 2003. N. Roy, G. Gordon, and S. Thrun. Finding approximate POMDP solutions through belief compression. Journal of Artiﬁcial Intelligence Research, 23:1–40, 2005. N. Roy, J. Pineau, and S. Thrun. Spoken dialog management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 93–100, Hong Kong, 2000. Matt Rudary, Satinder Singh, and David Wingate. Predictive linear-gaussian models of stochastic dynamical systems. In Proceedings of Uncertainty in Artiﬁcial Intelligence, pages 501–508, 2005. 2366 P OINT-BASED VALUE I TERATION FOR C ONTINUOUS POMDP S R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, pages 1080–1087, 1995. T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Proceedings of Uncertainty in Artiﬁcial Intelligence, pages 520–527, Banff, Alberta, 2004. E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University, 1971. M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for pomdps. Journal of Artiﬁcial Intelligence Research, 24:195–220, 2005. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. G. Theocharous and S. Mahadevan. Approximate planning with hierarchical partially observable Markov decision processes for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 1347–1352, 2002. S. Thrun. Monte Carlo POMDPs. In Advances in Neural Information Processing Systems 12 (NIPS1999), pages 1064–1070. MIT Press, 2000. ¨ N. Vlassis, B. Terwijn, and B.J.A. Krose. Auxiliary particle ﬁlter robot localization from highdimensional sensor observations. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 7–12, 2002. Steven D. Whitehead and Long-Ji Lin. Reinforcement learning of non-Markov decision processes. Artiﬁcial Intelligence, 73(1-2):271–306, 1995. J. Williams, P. Poupart, and S. Young. Using factored Markov decision processes with continuous observations for dialogue management. Technical Report CUED/F-INFEG/TR.520, Cambridge University, Engineering Department, 2005. B. Zhang, Q. Cai, J. Mao, and B. Guo. Planning and acting under uncertainty: a new model for spoken dialogue systems. In Proceedings of Uncertainty in Artiﬁcial Intelligence, pages 572–579, Seattle, WA, 2001. N. L. Zhang and W. Zhang. Speeding up the convergence of value iteration in partially observable Markov decision processes. Journal of Artiﬁcial Intelligence Research, 14:29–51, 2001. 2367