jmlr jmlr2009 jmlr2009-57 jmlr2009-57-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hui Li, Xuejun Liao, Lawrence Carin
Abstract: We consider the problem of multi-task reinforcement learning (MTRL) in multiple partially observable stochastic environments. We introduce the regionalized policy representation (RPR) to characterize the agent’s behavior in each environment. The RPR is a parametric model of the conditional distribution over current actions given the history of past actions and observations; the agent’s choice of actions is directly based on this conditional distribution, without an intervening model to characterize the environment itself. We propose off-policy batch algorithms to learn the parameters of the RPRs, using episodic data collected when following a behavior policy, and show their linkage to policy iteration. We employ the Dirichlet process as a nonparametric prior over the RPRs across multiple environments. The intrinsic clustering property of the Dirichlet process imposes sharing of episodes among similar environments, which effectively reduces the number of episodes required for learning a good policy in each environment, when data sharing is appropriate. The number of distinct RPRs and the associated clusters (the sharing patterns) are automatically discovered by exploiting the episodic data as well as the nonparametric nature of the Dirichlet process. We demonstrate the effectiveness of the proposed RPR as well as the RPR-based MTRL framework on various problems, including grid-world navigation and multi-aspect target classification. The experimental results show that the RPR is a competitive reinforcement learning algorithm in partially observable domains, and the MTRL consistently achieves better performance than single task reinforcement learning. Keywords: reinforcement learning, partially observable Markov decision processes, multi-task learning, Dirichlet processes, regionalized policy representation
D. Aberdeen and J. Baxter. Scalable internal-state policy-gradient methods for POMDPs. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 3–10. Morgan Kaufmann Publishers Inc., 2002. C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, November 1974. B. Bakker. The State of Mind: Reinforcement Learning with Recurrent Neural Networks. PhD thesis, Unit of Cognitive Psychology, Leiden University, 2004. B. Bakker and T. Heskes. Task clustering and gating for Bayesian multitask learning. Journal of Machine Learning Research, 4:83–99, 2003. J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001. M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. R. E. Bellman. Dynamic Programming. Princeton University Press, 1957. D. Blackwell. Discounted dynamic programming. Ann. Math. Stat., 36:226–235, 1965. D. Blackwell and J. MacQueen. Ferguson distributions via Polya urn schemes. Annals of Statistics, 1:353–355, 1973. H.-T. Cheng. Algorithms for Partially Observable Markov Decision Processes. PhD thesis, University of British Columbia, Vancouver, BC, 1988. L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth International Conference on Artificial Intelligence, pages 183–188. San Jose, California: AAAI Press, 1992. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39:1–38, 1977. F. Doshi, J. Pineau, and N. Roy. Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs. In Proceedings of the 25th International Conference on Machine Learning, pages 256–263. ACM, 2008. M. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, 1995. T. Ferguson. A Bayesian analysis of some non-parametric problems. The Annals of Statistics, 1: 209–230, 1973. A. Gelfand and A. Smith. Sample based approaches to calculating marginal densities. Journal of the American Statistical Association, 85:398–409, 1990. 1183 L I , L IAO AND C ARIN S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984. A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer Academic Press/Springer, 1992. C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia. Generalizing plans to new environments in relational MDPs. In International Joint Conference on Artificial Intelligence (IJCAI), 2003. E. A. Hansen. An improved policy iteration algorithm for partially observable MDPs. In Advances in Neural Information Processing Systems, volume 10, 1997. G. E. Hinton and T. J. Sejnowski. Learning and relearning in Boltzmann machines. In J. L. McClelland, D. E. Rumelhart, and the PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1, pages 282–317. MIT Press, Cambridge, MA, 1986. T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in Neural Information Processing Systems, volume 7. MIT Press, Cambridge, MA., 1995. T. S. Jaakkola. Tutorial on variational approximation methods. In M. Opper and D. Saad, editors, Advanced Mean Field Methods: Theory and Practice, pages 129–160. MIT Press, 2001. M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In Learning in Graphical Models, pages 105–161, Cambridge, MA, 1999. MIT Press. L. Kaelbling, M. Littman, and A. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99–134, 1998. N. D. Lawrence and J. C. Platt. Learning to learn with the informative vector machine. In Proceedings of the 21st International Conference on Machine Learning, 2004. H. Li. Planning and Learning in Partially Observable Stochastic Environments. PhD thesis, Duke University, 2006. publically available at http://people.ee.duke.edu/˜hl1/. H. Li, X. Liao, and L. Carin. Incremental least squares policy iteration for POMDPs. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI), 2006a. H. Li, X. Liao, and L. Carin. Region-based value iteration for partially observable Markov decision processes. In Proceedings of the 23nd International Machine Learning Conference, 2006b. X. Liao, H. Li, R. Parr, and L. Carin. Regionalized policy representation for reinforcement learning in POMDPs. In The Snowbird Learning Workshop, 2007. M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially obsevable environments: Scaling up. In Proceedings of the Twelfth International Conference on Machine Learning, pages 362–370, 1995. 1184 M ULTI - TASK R EINFORCEMENT L EARNING IN PARTIALLY O BSERVABLE S TOCHASTIC E NVIRONMENTS Q. Liu, X. Liao, and L. Carin. Semi-supervised multitask learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 937–944. MIT Press, Cambridge, MA, 2008. W. S. Lovejoy. Computationally feasible bounds for partially observed Markov decision processes. Operations Research, 39(1):162–175, 1991. S. N. MacEachern. Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics: Simulation and Computation, 23:727–741, 1994. R. A. McCallum. Reinforcement Learning with Selective Attention and Hidden State. PhD thesis, Department of Computer Science, University of Rochester, 1995. M. McClure and L. Carin. Matched pursuits with a wave-based dictionary. IEEE Trans. Signal Proc., 45:2912–2927, Dec. 1997. N. Meuleau, L. Peshkin, K. Kim, and L. Kaelbling. Learning finite-state controllers for partially observable environments. In Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), pages 427–43, San Francisco, CA, 1999. Morgan Kaufmann. R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Technical Report 9815, Dept. of Statistics, University of Toronto, 1998. J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of IJCAI, pages 1025 – 1032, August 2003. P. Poupart and C. Boutilier. Bounded finite state controllers. In Advances in Neural Information Processing Systems (NIPS), 2003. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–285, 1989. C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, 1999. S. Ross, B. Chaib-draa, and J. Pineau. Bayes-adaptive POMDPs. In Advances in Neural Information Processing Systems (NIPS), volume 20, 2008. P. R. Runkle, P. K. Bharadwaj, L. Couchman, and L. Carin. Hidden Markov models for multiaspect target classification. IEEE Transactions on Signal Processing, 47:2035–2040, July 1999. J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994. R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov processes over a finite horizon. Operational Research, 21:1071–1088, 1973. T. Smith and R. Simmons. Point-based POMDP algorithms: Improved analysis and implementation. In Proc. of the Conference on Uncertainty in Artificial Intelligence, 2005. E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University, 1971. 1185 L I , L IAO AND C ARIN E. J. Sondik. The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2):282–304, Mar. 1978. M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24:195220, 2005. R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. S. Thrun. Is learning the n-th thing any easier than learning the first? Information Processing Systems (NIPS), 1996. In Advances in Neural T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In Proceedings of the Twenty-Second International Conference on Machine Learning (ICML), pages 961–968, 2005. M. Welling, Y. W. Teh, and B. Kappen. Hybrid variational/Gibbs collapsed inference in topic models. In Proceedings of the Conference on Uncertainty in Artifical Intelligence (UAI), pages 587–594, 2008. M. West. Hyperparameter estimation in Dirichlet process mixture models. Technical Report 92A03, ISDS Discussion Paper, Duke University, 1992. M. West, P. Muller, and M.D. Escobar. Hierarchical priors and mixture models, with application in regression and density estimation. In A.F.M. Smith and P. Freeman, editors, Aspects of Uncertainty: A Tribute to D. V. Lindley, pages 363–386. New York: Wiley, 1994. D. Wierstra and M. Wiering. Utile distinction hidden Markov models. In Proceedings of the International Conference on Machine Learning, 2004. A. Wilson, A. Fern, S. Ray, and P. Tadepalli. Multi-task reinforcement learning: A hierarchical Bayesian approach. In Proceedings of the 24st International Conference on Machine Learning, 2007. Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classification with Dirichlet process priors. Journal of Machine Learning Research (JMLR), 8:35–63, 2007. K. Yu, A. Schwaighofer, V. Tresp, W.-Y. Ma, and H. Zhang. Collaborative ensemble learning: Combining collaborative and content-based information filtering via hierarchical Bayes. In Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, 2003. K. Yu, V. Tresp, and S. Yu. A nonparametric hierarchical Bayesian framework for information filtering. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004. 1186