nips nips2007 nips2007-34 nips2007-34-reference knowledge-graph by maker-knowledge-mining

34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Source: pdf

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artiﬁcial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efﬁcient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

reference text

[1] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5–43, 2003.

[2] H. Attias. Planning by probabilistic inference. In Uncertainty in Artiﬁcial Intelligence, 2003.

[3] J. Baxter and P. L. Bartlett. Inﬁnite-horizon policy-gradient estimation. Journal of Artiﬁcial Intelligence Research, 15:319–350, 2001.

[4] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientiﬁc, 1995.

[5] P. Dayan and G. E. Hinton. Using EM for reinforcement learning. Neural Computation, 9:271–278, 1997. 7 Evolution of policy parameters against transition-model samples policy parameter (theta) 0.8 0.7 rjmdp pegasus optimal 0.6 0.5 0.4 0.3 0.20 1000 2000 3000 4000 5000 6000 7000 8000 number of samples taken from transition-model 9000 Figure 4: Convergence of PEGASUS and our Bayesian policy search algorithm when started from θ = 0 and converging to the optimum of θ∗ = π/4. The plots are averaged over 10 runs. For our algorithm we plot samples taken directly from the MCMC algorithm itself: plotting the empirical average would produce an estimate whose convergence is almost immediate, but we also wanted to show the “burn-in” period. For both algorithms lines denoting one standard deviation are shown and performance is plotted against the number of samples taken from the transition model.

[6] A. Doucet and V. B. Tadic. On solving integral equations using Markov chain Monte Carlo methods. Technical Report CUED-F-INFENG 444, Cambridge University Engineering Department, 2004.

[7] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:711–732, 1995.

[8] P. J. Green. Trans-dimensional Markov chain Monte Carlo. In Highly Structured Stochastic Systems, 2003.

[9] M. Klaas, M. Briers, N. de Freitas, A. Doucet, and S. Maskell. Fast particle smoothing: If i had a million particles. In International Conference on Machine Learning, 2006.

[10] G. Lawrence, N. Cowan, and S. Russell. Efﬁcient gradient estimation for motor control learning. In Uncertainty in Artiﬁcial Intelligence, pages 354–36, 2003.

[11] P. M¨ ller. Simulation based optimal design. Bayesian Statistics, 6, 1999. u

[12] P. M¨ ller, B. Sans´ , and M. De Iorio. Optimal Bayesian design by inhomogeneous Markov chain simuu o lation. J. American Stat. Assoc., 99:788–798, 2004.

[13] A. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang. Inverted autonomous helicopter ﬂight via reinforcement learning. In International Symposium on Experimental Robotics, 2004.

[14] A. Y. Ng and M. I. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Uncertainty in Artiﬁcial Intelligence, 2000.

[15] J. Peters and S. Schaal. Policy gradient methods for robotics. In IEEE International Conference on Intelligent Robotics Systems, 2006.

[16] M. Porta, N. Vlassis, M. T. J. Spaan, and P. Poupart. Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research, 7:2329–2367, 2006.

[17] S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society B, 59(4):731–792, 1997.

[18] S. Thrun. Monte Carlo POMDPs. In S. Solla, T. Leen, and K.-R. M¨ ller, editors, Neural Information u Processing Systems, pages 1064–1070. MIT Press, 2000.

[19] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving (PO)MDPs. Technical Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006.

[20] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov decision processes. In International Conference on Machine Learning, 2006.

[21] D. Verma and R. P. N. Rao. Planning and acting in uncertain environments using probabilistic inference. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2006. 8