jmlr jmlr2006 jmlr2006-20 jmlr2006-20-reference knowledge-graph by maker-knowledge-mining

20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

Source: pdf

Author: Jelle R. Kok, Nikos Vlassis

Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efﬁcient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation

reference text

S. Muhammad Ali, S. Koenig, and M. Tambe. Preprocessing techniques for accelerating the DCOP algorithm ADOPT. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1041–1048, Utrecht, The Netherlands, 2005. T. Arai, E. Pagello, and L. E. Parker. Editorial: Advances in multi-robot systems. IEEE Transactions on Robotics and Automation, 18(5):665–661, 2002. R. Becker, S. Zilberstein, V. Lesser, and C. V. Goldman. Transition-independent decentralized Markov decision processes. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), Melbourne, Australia, 2003. D. S. Bernstein, S. Zilberstein, and N. Immerman. The complexity of decentralized control of Markov decision processes. In Proceedings of Uncertainty in Artiﬁcial Intelligence (UAI), Stanford, CA, 2000. U. Bertel´ and F. Brioschi. Nonserial dynamic programming. Academic Press, 1972. e D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientiﬁc, 1996. 1824 C OLLABORATIVE M ULTIAGENT R EINFORCEMENT L EARNING BY PAYOFF P ROPAGATION C. Boutilier. Planning, learning and coordination in multiagent decision processes. In Proceedings of the Conference on Theoretical Aspects of Rationality and Knowledge, 1996. J. A. Boyan and M. L. Littman. Packet routing in dynamically changing networks: A reinforcement learning approach. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing Systems (NIPS) 6, pages 671–678. Morgan Kaufmann Publishers, Inc., 1994. W. Burgard, M. Moors, D. Fox, R. Simmons, and S. Thrun. Collaborative multi-robot exploration. In Proceedings of the IEEE International Conference on Robotics and Automation, 2000. G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: A Bayesian approach. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 709–716, Melbourne, Australia, 2003. ACM Press. C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI), Madison, WI, 1998. C. Crick and A. Pfeffer. Loopy belief propagation as a basis for communication in sensor networks. In Proceedings of Uncertainty in Artiﬁcial Intelligence (UAI), 2003. R. Crites and A. Barto. Improving elevator performance using reinforcement learning. In Advances in Neural Information Processing Systems (NIPS) 8, pages 1017–1023. MIT Press, 1996. R. Dechter. Constraint Processing. Morgan Kaufmann, 2003. R. Dechter and I. Rish. A scheme for approximating probabilistic inference. In Proceedings of Uncertainty in Artiﬁcial Intelligence (UAI), pages 132–141, 1997. E. H. Durfee. Scaling up agent coordination strategies. IEEE Computer, 34(7):39–46, July 2001. P. S. Dutta, N. R. Jennings, and L. Moreau. Cooperative information sharing to improve distributed learning in multi-agent systems. Journal of Artiﬁcial Intelligence Research, 24:407–463, 2005. C. Goldman and S. Zilberstein. Decentralized control of cooperative systems: Categorization and complexity analysis. Journal of Artiﬁcial Intelligence Research, 22:143–174, November 2004. C. V. Goldman and S. Zilberstein. Optimizing information exchange in cooperative multi-agent systems. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 137–144, New York, NY, USA, 2003. ACM Press. C. Guestrin, D. Koller, and R. Parr. Multiagent planning with factored MDPs. In Advances in Neural Information Processing Systems (NIPS) 14. The MIT Press, 2002a. C. Guestrin. Planning under uncertainty in complex structured environments. PhD thesis, Computer Science Department, Stanford University, August 2003. C. Guestrin, M. Lagoudakis, and R. Parr. Coordinated reinforcement learning. In International Conference on Machine Learning (ICML), Sydney, Australia, July 2002b. 1825 KOK AND V LASSIS C. Guestrin, S. Venkataraman, and D. Koller. Context-speciﬁc multiagent coordination and planning with factored MDPs. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI), Edmonton, Canada, July 2002c. E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI), San Jose, CA, 2004. H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, and E. Osawa. RoboCup: The robot world cup initiative. In Proceedings of the IJCAI-95 Workshop on Entertainment and AI/Alife, 1995. J. R. Kok and N. Vlassis. Sparse cooperative Q-learning. In Russ Greiner and Dale Schuurmans, editors, Proceedings of the International Conference on Machine Learning, pages 481–488, Banff, Canada, July 2004. ACM. J. R. Kok and N. Vlassis. Using the max-plus algorithm for multiagent decision making in coordination graphs. In RoboCup-2005: Robot Soccer World Cup IX, Osaka, Japan, July 2005. J. R. Kok, M. T. J. Spaan, and N. Vlassis. Non-communicative multi-robot coordination in dynamic environments. Robotics and Autonomous Systems, 50(2-3):99–114, February 2005. J. R. Kok. Coordination and Learning in Cooperative Multiagent Systems. PhD thesis, Faculty of Science, University of Amsterdam, 2006. F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47:498–519, 2001. V. Lesser, C. Ortiz, and M. Tambe. Distributed sensor nets: A multiagent perspective. Kluwer academic publishers, 2003. H.-A. Loeliger. An introduction to factor graphs. IEEE Signal Processing Magazine, pages 28–41, January 2004. C. C. Moallemi and B. Van Roy. Distributed optimization in adaptive networks. In S. Thrun, L. Saul, and B. Sch¨ lkopf, editors, Advances in Neural Information Processing Systems (NIPS) 16. MIT o Press, Cambridge, MA, 2004. P. Jay Modi, W-M. Shen, M. Tambe, and M. Yokoo. ADOPT: Asynchronous distributed constraint optimization with quality guarantees. Artiﬁcial Intelligence, 161(1-2):149–180, 2005. K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of Uncertainty in Artiﬁcial Intelligence (UAI), Stockholm, Sweden, 1999. A. Y. Ng, H. Jin Kim, M. Jordan, and S. Sastry. Autonomous helicopter ﬂight via reinforcement learning. In Advances in Neural Information Processing Systems (NIPS) 16, 2004. L. E. Parker. Distributed algorithms for multi-robot observation of multiple moving targets. Autonomous Robots, 12(3):231–255, 2002. J. Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufman, San Mateo, 1988. 1826 C OLLABORATIVE M ULTIAGENT R EINFORCEMENT L EARNING BY PAYOFF P ROPAGATION L. Peshkin, K.-E. Kim, N. Meuleau, and L. P. Kaelbling. Learning to cooperate via policy search. In Proceedings of Uncertainty in Artiﬁcial Intelligence (UAI), pages 489–496. Morgan Kaufmann Publishers, 2000. M. L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. Wiley, New York, 1994. D. V. Pynadath and M. Tambe. The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of Artiﬁcial Intelligence Research, 16:389–423, 2002. J. Schneider, W.-K. Wong, A. Moore, and M. Riedmiller. Distributed value functions. In International Conference on Machine Learning (ICML), Bled, Slovenia, 1999. S. Sen, M. Sekaran, and J. Hale. Learning to coordinate without sharing information. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI), Seattle, WA, 1994. L. Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39:1095–1100, 1953. P. Stone, R. S. Sutton, and G. Kuhlmann. Reinforcement learning for RoboCup-soccer keepaway. Adaptive Behavior, 13(3):165–188, 2005. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, Cambridge, MA, 1998. K. Sycara. Multiagent systems. AI Magazine, 19(2):79–92, 1998. M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In International Conference on Machine Learning (ICML), Amherst, MA, 1993. G. Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), March 1995. N. Vlassis. A concise introduction to multiagent systems and distributed AI. Informatics Institute, University of Amsterdam, September 2003. N, Vlassis, R. Elhorst, and J. R. Kok. Anytime algorithms for multiagent decision making using coordination graphs. In Proceedings of the International Conference on Systems, Man, and Cybernetics (SMC), The Hague, The Netherlands, October 2004. M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree consistency and bounds on the performance of the max-product algorithm and its generalizations. Statistics and Computing, 14: 143–166, April 2004. C. Watkins and P. Dayan. Technical note: Q-learning. Machine Learning, 8(3-4):279–292, 1992. G. Weiss, editor. Multiagent systems: A modern approach to distributed artiﬁcial intelligence. MIT Press, 1999. J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. In Exploring Artiﬁcial Intelligence in the New Millennium, chapter 8, pages 239–269. Morgan Kaufmann Publishers Inc., January 2003. 1827 KOK AND V LASSIS M. Yokoo and E. H. Durfee. Distributed constraint optimization as a formal model of partially adversarial cooperation. Technical Report CSE-TR-101-91, University of Michigan, Ann Arbor, MI 48109, 1991. N. Lianwen Zhang and D. Poole. Exploiting causal independence in Bayesian network inference. Journal of Artiﬁcial Intelligence Research, 5:301–328, 1996. 1828