nips nips2005 nips2005-145 nips2005-145-reference knowledge-graph by maker-knowledge-mining

145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Source: pdf

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

reference text

[1] N. Alon and J. Spencer. The Probabilistic Method. Wiley, 2000.

[2] C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning: Structural assumptions and computational leverage. Journal of Artiﬁcial Intelligence Research, 1999.

[3] Y. Chang, T. Ho, and L. Kaelbling. All learning is local: Multi-agent learning in global reward games. In Advances in NIPS 14, 2004.

[4] C. Guestrin, D. Koller, and R. Parr. Multi-agent planning with factored MDPs. In NIPS-14, 2002.

[5]

[6] M. Kearns and D. Koller. Efﬁcient reinforcement learning in factored mdps. In IJCAI 16, 1999.

[7] M. Kearns, M. Littman, and S. Singh. Graphical models for game theory. In UAI, 2001.

[8] M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. (extended version of paper in NIPS 12), 1999.

[9] L. Peshkin, K-E. Kim, N. Meleau, and L. Kaelbling. Learning to cooperate via policy search. In UAI 16, 2000. C. Guestrin, M. Lagoudakis, and R. Parr. Coordinated reinforcement learning. In ICML, 2002.

[10] J. Schneider, W. Wong, A. Moore, and M. Riedmiller. Distributed value functions. In ICML, 1999.

[11] R. Williams and L. Baird. Tight performance bounds on greedy policies based on imperfect value functions. Technical report, Northeastern University, 1993.