nips nips2010 nips2010-196 nips2010-196-reference knowledge-graph by maker-knowledge-mining

196 nips-2010-Online Markov Decision Processes under Bandit Feedback

Source: pdf

Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári

Abstract: We consider online learning in ﬁnite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the ﬁrst rigorously proved regret bound for the problem. 1

reference text

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77. Even-Dar, E., Kakade, S. M., and Mansour, Y. (2005). Experts in a Markov decision process. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 17, pages 401–408. Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736. Neu, G., Gy¨ rgy, A., and Szepesv´ ri, C. (2010). The online loop-free stochastic shortest-path probo a lem. In COLT-10. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience. Yu, J. Y. and Mannor, S. (2009a). Arbitrarily modulated Markov decision processes. In Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference. IEEE Press. Yu, J. Y. and Mannor, S. (2009b). Online learning in Markov decision processes with arbitrarily changing rewards and transitions. In GameNets’09: Proceedings of the First ICST international conference on Game Theory for Networks, pages 314–322, Piscataway, NJ, USA. IEEE Press. Yu, J. Y., Mannor, S., and Shimkin, N. (2009). Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757. 9