nips nips2012 nips2012-88 nips2012-88-reference knowledge-graph by maker-knowledge-mining

88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Source: pdf

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

Abstract: In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. 1

reference text

[1] R. Howard. Dynamic programming. MIT Press, 1960.

[2] M. Duff. Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massachusetts, Amherst, 2002.

[3] S. Ross, J. Pineau, B. Chaib-draa, and P. Kreitmann. A Bayesian approach for larning and planning in partially observable markov decision processes. Journal of Machine Learning Research, 12, 2011.

[4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to descrete Bayesian reinforcement learning. In Proc. of ICML, 2006.

[5] E. Altman. Constrained Markov Decision Processes. Chapman & Hall/CRC, 1999.

[6] K. W. Ross and R. Varadarajan. Markov decision-processes with sample path constraints - the communicating case. Operations Research, 37(5):780–790, 1989.

[7] K. W. Ross and R. Varadarajan. Multichain Markov decision-processes with a sample path constraint - a decomposition approach. Mathematics of Operations Research, 16(1):195–207, 1991.

[8] E. Delage and S. Mannor. Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research, 58(1), 2010.

[9] A. Hans, D. Schneegaß, A. M. Sch¨ fer, and S. Udluft. Safe exploration for reinforcement a learning. In Proc. of 16th European Symposium on Artiﬁcial Neural Networks, 2008.

[10] T. M. Moldovan and P. Abbeel. Safe exploration in Markov decision processes. In Proc. of NIPS Workshop on Bayesian Optimization, Experimental Design and Bandits, 2011.

[11] J. D. Isom, S. P. Meyn, and R. D. Braatz. Piecewise linear dynamic programming for constrained POMDPs. In Proc. of AAAI, 2008.

[12] D. Kim, J. Lee, K.-E. Kim, and P. Poupart. Point-based value iteration for constrained POMDPs. In Proc. of IJCAI, 2011.

[13] A. B. Piunovskiy and X. Mao. Constrained Markovian decision processes: the dynamic programming approach. Operations Research Letters, 27(3):119–126, 2000.

[14] M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artiﬁcial Intelligence Research, 24, 2005.

[15] R. Dearden, N. Friedman, and D. Andre. Bayesian Q-learning. In Proc. of AAAI, 1998.

[16] M. Strens. A Bayesian framework for reinforcement learning. In Proc. of ICML, 2000. 9