nips nips2009 nips2009-134 nips2009-134-reference knowledge-graph by maker-knowledge-mining

134 nips-2009-Learning to Explore and Exploit in POMDPs


Source: pdf

Author: Chenghui Cai, Xuejun Liao, Lawrence Carin

Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.


reference text

[1] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, Univertisity College London, 2003.

[2] R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(OCT):213–231, 2002.

[3] F. Doshi, J. Pineau, and N. Roy. Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs. In Proceedings of the 25th international conference on Machine learning, pages 256–263. ACM, 2008.

[4] M. Kearns and D. Koller. Efficient reinforcement learning in factored mdps. In Proc. of the Sixteenth International Joint Conference of Artificial Intelligence, pages 740–747, 1999.

[5] M. Kearns and S. P. Singh. Near-optimal performance for reinforcement learning in polynomial time. In Proc. ICML, pages 260–268, 1998.

[6] H. Li, X. Liao, and L. Carin. Multi-task reinforcement learning in partially observable stochastic environments. Journal of Machine Learning Research, 10:1131–1186, 2009.

[7] M.L. Littman, A.R. Cassandra, and L.P. Kaelbling. Learning policies for partially observable environments: scaling up. In ICML, 1995.

[8] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), pages 1025 – 1032, August 2003.

[9] P. Poupart and N. Vlassis. Model-based bayesian reinforcement learning in partially observable domains. In International Symposiu on Artificial Intelligence and Mathmatics (ISAIM), 2008.

[10] R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, Cambridge, MA, 1998. 9