nips nips2011 nips2011-245 nips2011-245-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos
Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1
[1] Peter Auer, Nicol` Cesa-Bianchi, and Paul Fischer. Finite time analysis of the multiarmed o bandit problem. Machine Learning, 47(2-3):235–256, 2002.
[2] Peter Auer, Nicol` Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged o casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pages 322 –331, oct 1995.
[3] Peter L. Bartlett and Ambuj Tewari. REGAl: a regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI, pages 35–42, Arlington, Virginia, United States, 2009. AUAI Press.
[4] Ronen I. Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, March 2003.
[5] Marcus Hutter. Feature reinforcement learning: Part I: Unstructured MDPs. Journal of Artificial General Intelligence, 1:3–24, 2009.
[6] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 99:1563–1600, August 2010.
[7] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49:209–232, November 2002.
[8] Tze L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
[9] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527–535, 1952.
[10] Daniil Ryabko and Marcus Hutter. On the possibility of learning in reactive environments with arbitrary dependence. Theoretical Compututer Science, 405:274–284, October 2008.
[11] Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, ICML, pages 881–888, New York, NY, USA, 2006. ACM.
[12] Ambuj Tewari and Peter L. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible mdps. In Proceedings of Neural Information Processing Systems Conference (NIPS), 2007. 9