nips nips2003 nips2003-105 nips2003-105-reference knowledge-graph by maker-knowledge-mining

105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

Source: pdf

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efﬁcient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions). 1

reference text

[1] Bellare and Rogaway. Random oracle are practical: A paradigm for designing efﬁcient protocols. In Proceedings of First ACM Annual Conference on Computer and Communication Security, 93.

[2] Boutilier. Planning, learning and coordination in multi-agent decision processes. In TARK, 96.

[3] Brafman and Tennenholtz. R-max: A general polynomial time algorithm for near-optimal reinforcement learning. In IJCAI, 01.

[4] Claus and Boutilier. The dynamics of reinforcement learning in cooperative multi-agent systems. In AAAI, 98.

[5] Fiechter. Efﬁcient reinforcement learning. In COLT, 94.

[6] Fudenberg and Levine. The theory of learning in games. MIT Press, 98.

[7] Greenwald and Hall. Correlated-q learning. In AAAI Spring Symposium, 02.

[8] Hu and Wellman. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, 98.

[9] Kaelbling, Littman, and Moore. Reinforcement learning: A survey. JAIR, 96.

[10] Kearns and Singh. Near-optimal reinforcement learning in polynomial time. In ICML, 98.

[11] Littman. Value-function reinforcement learning in markov games. J. of Cognitive System Research, 2:55–66, 00.

[12] Littman. Friend-or-Foe Q-learning in general sum game. In ICML, 01.

[13] Lov´ and Winkler. Exact mixing in an unknown markov chain. Electronic Journal of Combinatorics, 95. asz

[14] Pivazyan and Shoham. Polynomial-time reinforcement learning of near-optimal policies. In AAAI, 02.

[15] Wang and Sandholm. Learning to play pareto-optimal equilibria: Convergence and efﬁciency. www.cs.cmu.edu/˜xiaofeng/LearnPOC.ps.

[16] Wang and Sandholm. Reinforcement learning to play an optimal Nash equilibrium in team markov game. In NIPS, 02.

[17] Young. The evolution of conventions. Econometrica, 61:57–84, 93.

[18] Young. An evolutionary model of bargaining. Journal of Economic Theory, 59, 93. 4 Recall that agents have established the same numbering of actions. This allows them to encode their joint actions for inputting into γ in the same way. 5 The pattern of the for-loops is from the Lov´ asz-Winkler algorithm [13].