nips nips2002 nips2002-78 knowledge-graph by maker-knowledge-mining

78 nips-2002-Efficient Learning Equilibrium


Source: pdf

Author: Ronen I. Brafman, Moshe Tennenholtz

Abstract: We introduce efficient learning equilibrium (ELE), a normative approach to learning in non cooperative settings. In ELE, the learning algorithms themselves are required to be in equilibrium. In addition, the learning algorithms arrive at a desired value after polynomial time, and deviations from a prescribed ELE become irrational after polynomial time. We prove the existence of an ELE in the perfect monitoring setting, where the desired value is the expected payoff in a Nash equilibrium. We also show that an ELE does not always exist in the imperfect monitoring case. Yet, it exists in the special case of common-interest games. Finally, we extend our results to general stochastic games. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce efficient learning equilibrium (ELE), a normative approach to learning in non cooperative settings. [sent-7, score-0.413]

2 In addition, the learning algorithms arrive at a desired value after polynomial time, and deviations from a prescribed ELE become irrational after polynomial time. [sent-9, score-0.359]

3 We prove the existence of an ELE in the perfect monitoring setting, where the desired value is the expected payoff in a Nash equilibrium. [sent-10, score-0.453]

4 We also show that an ELE does not always exist in the imperfect monitoring case. [sent-11, score-0.328]

5 Much of this work uses repeated games [3, 5] and stochastic games [10, 9, 7, 1] as models of such interactions. [sent-15, score-0.691]

6 The literature on learning in games in game theory [5] is mainly concerned with the understanding of learning procedures that if adopted by the different agents will converge at end to an equilibrium of the corresponding game. [sent-16, score-1.269]

7 The game itself may be known; the idea is to show that simple dynamics lead to rational behavior, as prescribed by a Nash equilibrium. [sent-17, score-0.558]

8 The learning algorithms themselves are not required to satisfy any rationality requirement; it is what they converge to, if adopted by all agents that should be in equilibrium. [sent-18, score-0.326]

9 When facing uncertainty about the game that is played, game-theorists appeal to a Bayesian approach, which is completely different from a learning approach; the typical assumption in that approach is that there exists a probability distribution on the possible games, which is common-knowledge. [sent-28, score-0.541]

10 The notion of equilibrium is extended to this context of games with incomplete information, and is treated as the appropriate solution concept. [sent-29, score-0.527]

11 In this context, agents are assumed to be rational agents adopting the corresponding (Bayes-) Nash equilibrium, and learning is not an issue. [sent-30, score-0.532]

12 Adopting the framework of repeated games, we consider a situation where the learning algorithm is a strategy for an agent in a repeated game. [sent-32, score-0.449]

13 This strategy takes an action at each stage based on its previous observations, and initially has no information about the identity of the game being played. [sent-33, score-0.62]

14 It should be irrational for each agent to deviate from its learning algorithm, as long as the other agents stick to their algorithms, regardless of the what the actual game is. [sent-36, score-1.161]

15 Efficiency: (a) A deviation from the learning algorithm by a single agent (while the other stick to their algorithms) will become irrational (i. [sent-38, score-0.449]

16 will lead to a situation where the deviator 's payoff is not improved) after polynomially many stages. [sent-40, score-0.202]

17 (b) If all agents stick to their prescribed learning algorithms then the expected payoff obtained by each agent within a polynomial number of steps will be (close to) the value it could have obtained in a Nash equilibrium, had the agents known the game from the outset. [sent-41, score-1.485]

18 A tuple of learning algorithms satisfying the above properties for a given class of games is said to be an Efficient Learning Equilibrium[ELE]. [sent-42, score-0.346]

19 Notice that the learning algorithms should satisfy the desired properties for every game in a given class despite the fact that the actual game played is initially unknown. [sent-43, score-1.142]

20 What we borrow from the game theory literature is the criterion for rational behavior in multi-agent systems. [sent-45, score-0.545]

21 We also take the equilibrium of the actual (initially unknown) game to be our benchmark for success; we wish to obtain a corresponding value although we initially do not know which game is played. [sent-47, score-1.206]

22 We also prove the existence of an ELE (satisfying all of the above desired properties) for a general class of games (repeated games with perfect monitoring) , and show that it does not exist for another. [sent-49, score-0.733]

23 Technically speaking, the results we prove rely on a novel combination of the socalled folk theorems in economics, and a novel efficient algorithm for the punishment of deviators (in games which are initially unknown). [sent-53, score-0.511]

24 For ease of exposition, our discussion will center on two-player repeated games in which the agents have an identical set of actions A. [sent-55, score-0.692]

25 The generalization to n-player repeated games with different action sets is immediate, but requires a little more notation. [sent-56, score-0.458]

26 The extension to stochastic games is fully discussed in the full paper [2]. [sent-57, score-0.307]

27 A common description of a (two-player) game is as a matrix. [sent-63, score-0.48]

28 The rows of the matrix correspond to player 1 's actions and the columns correspond to player 2's actions. [sent-65, score-0.603]

29 The entry in row i and column j in the game matrix contains the rewards obtained by the players if player 1 plays his ith action and player 2 plays his jth action. [sent-66, score-1.354]

30 In a repeated game (RG) the players playa given game G repeatedly. [sent-67, score-1.219]

31 We can view a repeated game, with respect to a game G, as consisting of infinite number of iterations, at each of which the players have to select an action of the game G . [sent-68, score-1.312]

32 For ease of exposition we normalize both players' payoffs in the game G to be nonnegative reals between and some positive constant Rmax . [sent-70, score-0.663]

33 째 In a perfect monitoring setting, the set of possible histories of length t is (A2 X p2)t, and the set of possible histories, H, is the union of the sets of possible histories for all t 2 0, where (A2 x p 2)O is the empty history. [sent-72, score-0.318]

34 Namely, the history at time t consists of the history of actions that have been carried out so far, and the corresponding payoffs obtained by the players. [sent-73, score-0.2]

35 Hence, in a perfect monitoring setting, a player can observe the actions selected and the payoffs obtained in the past, but does not know the game matrix to start with. [sent-74, score-1.186]

36 In an imperfect monitoring setup, all that a player can observe following the performance of its action is the payoff it obtained and the action selected by the other player. [sent-75, score-0.923]

37 The definition of the possible histories for an agent naturally follows. [sent-77, score-0.315]

38 Finally, in a strict imperfect monitoring setting, the agent cannot observe the other agents' payoff or their actions. [sent-78, score-0.712]

39 Given an RG , a policy for a player is a mapping from H, the set of possible histories , to the set of possible probability distributions over A. [sent-79, score-0.411]

40 2) of a policy profile (1f, p), where 1f is a policy for player 1 and p is a policy for player 2, using the expected average reward criterion as follows. [sent-83, score-0.913]

41 Given an RG M and a natural number T, we denote the expected T -step undiscounted average reward of player 1 (resp. [sent-84, score-0.328]

42 2) when the players follow the policy profile (1f,p), by U1 (M,1f,p,T) (resp. [sent-85, score-0.311]

43 A policy profile (1f, p) is a learning equilibrium w. [sent-89, score-0.381]

44 In this paper we mainly treat the class M of all repeated games with some fixed action profile (i. [sent-93, score-0.592]

45 , in which the set of actions available to all agents is fixed). [sent-95, score-0.283]

46 We shall stick to the assumption that both agents have a fixed , identical set A of k actions. [sent-97, score-0.302]

47 The definition of this desired value may be a parameter, the most natural candidate - though not the only candidate - being the expected payoffs in a Nash equilibrium of the game. [sent-101, score-0.452]

48 Then, denote the expected payoff of agent i in n(G) by Nl/i(n(G)). [sent-104, score-0.39]

49 째 Notice that a deviation is considered irrational if it does not increase the expected payoff by more than E. [sent-109, score-0.356]

50 This is in the spirit of E-equilibrium in game theory. [sent-110, score-0.48]

51 This will require, however, that we restrict our attention to games where there exist a Nash equilibrium in which the agents' expected payoffs are higher than their probabilistic maximin values. [sent-113, score-0.72]

52 We assume that initially the game is unknown, but the agents will have learning algorithms that will rapidly lead to the values the players would have obtained in a Nash equilibrium had they known the game. [sent-115, score-1.144]

53 Notice that each agent's behavior should be the best response against the other agents' behaviors, and deviations should be irrational, regardless of what the actual (one-shot) game is. [sent-117, score-0.52]

54 3 Efficient Learning Equilibrium: Existence Let M be a repeated game in which G is played at each iteration. [sent-118, score-0.618]

55 Next, each agent computes a Nash equilibrium of the game and follows it. [sent-127, score-0.88]

56 If more than one equilibrium exists, then the first one according to the natural lexicographic ordering is used. [sent-128, score-0.225]

57 l If one of the agents does not collaborate in the initial exploration phase, the other agent "punishes" this agent. [sent-129, score-0.469]

58 Otherwise, the agents have chosen a Nash-equilibrium, and it is irrational for them to deviate from this equilibrium unilaterally. [sent-131, score-0.598]

59 This idea combines the so-called folk-theorems in economics [6], and a technique for learning in zero-sum games introduced in [1]. [sent-132, score-0.361]

60 Folk-theorems in economics deal with a technique for obtaining some desired behavior by making a threat of employing a punishing strategy against a deviator from that behavior. [sent-133, score-0.24]

61 When both agents are equipped with corresponding punishing strategies, the desired behavior will be obtained in equilibrium (and the threat will not be materialized - as a deviation becomes irrational). [sent-134, score-0.578]

62 These are addressed by having an efficient punishment algorithm that guarantees that the other agent will not obtain more than its maximin value, after polynomial time, although the game is initially unknown to the punishing agent. [sent-136, score-1.059]

63 In parallel, player 2 performs the sequence of actions (al' . [sent-143, score-0.334]

64 If both players behaved according to the above then a Nash equilibrium of the corresponding (revealed) game is computed, and the players behave according to the corresponding strategies from that point on. [sent-147, score-1.019]

65 lIn particular, the agents can choose the equilibrium selected by a fixed shared algorithm. [sent-149, score-0.459]

66 If one of the players deviated from the above, we shall call this player the adversary and the other player the agent. [sent-150, score-0.849]

67 Let G be the Rmax-sum game in which the adversary's payoff is identical to his payoff in the original game, and where the agent's payoff is Rmax minus the adversary payoffs. [sent-151, score-1.147]

68 Thus, G is a constant-sum game where the agent's goal is to minimize the adversary's payoff. [sent-153, score-0.48]

69 Notice that some of these payoffs will be unknown (because the adversary did not cooperate in the exploration phase). [sent-154, score-0.332]

70 The agent now plays according to the following algorithm: Initialize: Construct the following model M' of the repeated game M, where the game G is replaced by a game G' where all the entries in the game matrix are assigned the rewards (R max , 0). [sent-155, score-2.28]

71 Observe and update: Following each joint action do as follows : Let a be the action the agent performed and let a' be the adversary's action. [sent-159, score-0.368]

72 Recall- the agent takes its payoff to be complementary to the (observed) adversary's payoff. [sent-161, score-0.369]

73 We can show that the policy profile in which both agents use the ELE algorithm is indeed an ELE. [sent-162, score-0.372]

74 It rests on the ability of the agent to "punish" the adversary quickly, making it irrational for the adversary to deviate from the ELE algorithm. [sent-169, score-0.684]

75 4 Imperfect monitoring In the previous section we discussed the existence of an ELE in the context of the perfect monitoring setup. [sent-170, score-0.396]

76 An interesting question is whether one can go beyond that and show the existence of an ELE in the imperfect monitoring case as well. [sent-172, score-0.321]

77 Theorem 2 There exist classes of games for which an ELE does not exist given imperfect monitoring. [sent-174, score-0.497]

78 2The value 0 given to the adversary does not play an important role here. [sent-175, score-0.19]

79 Proof (sketch): We will consider the class of all 2 x 2 games and we will show that an ELE does not exist for this class under imperfect monitoring. [sent-176, score-0.505]

80 G2: M = 0,100 ) 1, 500 , (6, 9 0 1) 5,11 1, 10 Notice that the payoffs obtained for a joint action in Gland G 2 are identical for player 1 and are different for player 2. [sent-179, score-0.747]

81 The only equilibrium of G 1 is where both players play the second action, leading to (1,500). [sent-180, score-0.395]

82 The only equilibrium of G2 is where both players play the first action, leading to (6,9). [sent-181, score-0.395]

83 Notice that in order to have an ELE, we must visit the entry (6,9) most of the times if the game is G2 and visit the entry (1 ,500) most of the times if the game is G 1; otherwise, player 1 (resp. [sent-184, score-1.313]

84 player 2) will not obtain a high enough value in G2 (resp. [sent-185, score-0.269]

85 Given the above, it is rational for player 2 to deviate and pretend that the game is always Gland behave according to what the suggested equilibrium policy tells it to do in that case. [sent-188, score-1.141]

86 Since the game might be actually G 1, and player 1 cannot tell the difference, player 2 will be able to lead to playing the second action by both players for most times also when the game is G2, increasing its payoff from 9 to 10, contradicting ELE. [sent-189, score-1.919]

87 A game is called a common-interest game if for every joint-action, all agents receive the same reward. [sent-192, score-1.178]

88 We can show: Theorem 3 Let M c - i be the class of common-interest repeated games in which the number of actions each agent has is a. [sent-193, score-0.668]

89 There exists an ELE for M c - i under strict imperfect monitoring. [sent-194, score-0.208]

90 Proof (sketch): The agents use the following algorithm: for m rounds , each agent randomly selects an action. [sent-195, score-0.416]

91 Following this, each agent plays the action that yielded the best reward. [sent-196, score-0.298]

92 Using Chernoff bound we can choose m that is polynomial in the size of the game (which is a k , where k is the number of agents) and in 1/ J. [sent-199, score-0.523]

93 Not only does it provably converge in polynomial time, it is also guaranteed, with probability of 1 - J to converge to the optimal Nash-equilibrium of the game rather than to an arbitrary (and possibly non-optimal) Nash-equilibrium. [sent-201, score-0.561]

94 5 Conclusion We defined the concept of an efficient learning equilibria - a normative criterion for learning algorithms. [sent-202, score-0.22]

95 We showed that given perfect monitoring a learning algorithm satisfying ELE exists, while this is not the case under imperfect monitoring. [sent-203, score-0.368]

96 A Pareto ELE is similar to a (Nash) ELE, except that the requirement of attaining the expected payoffs of a Nash equilibrium is replaced by that of maximizing social surplus. [sent-205, score-0.416]

97 We show that there fexists a Pareto-ELE for any perfect monitoring setting, and that a Pareto ELE does not always exist in an imperfect monitoring setting. [sent-206, score-0.524]

98 In the full paper we also extend our discussion from repeated games to infinite horizon stochastic games under the average reward criterion. [sent-207, score-0.748]

99 Predicting how people play games: Reinforcement learning in games with unique strategy equilibrium. [sent-233, score-0.365]

100 Markov games as a framework for multi-agent reinforcement learning. [sent-264, score-0.327]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('game', 0.48), ('ele', 0.389), ('games', 0.282), ('player', 0.269), ('agents', 0.218), ('equilibrium', 0.202), ('agent', 0.198), ('payoff', 0.171), ('players', 0.157), ('adversary', 0.154), ('nash', 0.148), ('imperfect', 0.147), ('monitoring', 0.147), ('irrational', 0.139), ('payoffs', 0.135), ('repeated', 0.102), ('punishment', 0.093), ('efficient', 0.092), ('policy', 0.081), ('action', 0.074), ('profile', 0.073), ('actions', 0.065), ('pareto', 0.062), ('stick', 0.062), ('histories', 0.061), ('definition', 0.056), ('brafman', 0.054), ('economics', 0.054), ('perfect', 0.049), ('normative', 0.049), ('rational', 0.047), ('maximin', 0.046), ('punishing', 0.046), ('rationality', 0.046), ('rmax', 0.046), ('reinforcement', 0.045), ('initially', 0.044), ('polynomial', 0.043), ('deviate', 0.039), ('desired', 0.038), ('reward', 0.038), ('exists', 0.036), ('played', 0.036), ('play', 0.036), ('requirement', 0.036), ('gl', 0.034), ('rewards', 0.034), ('exist', 0.034), ('notice', 0.033), ('deviator', 0.031), ('fudenberg', 0.031), ('gland', 0.031), ('moshe', 0.031), ('threat', 0.031), ('prescribed', 0.031), ('equilibria', 0.029), ('rg', 0.029), ('collaborate', 0.027), ('existence', 0.027), ('plays', 0.026), ('exploration', 0.026), ('context', 0.026), ('ease', 0.025), ('strict', 0.025), ('learning', 0.025), ('deviation', 0.025), ('stochastic', 0.025), ('adopting', 0.024), ('observe', 0.024), ('behave', 0.023), ('exposition', 0.023), ('lexicographic', 0.023), ('visit', 0.023), ('strategy', 0.022), ('fixed', 0.022), ('ui', 0.022), ('deviations', 0.022), ('let', 0.022), ('social', 0.022), ('deviates', 0.022), ('expected', 0.021), ('class', 0.021), ('cooperative', 0.02), ('management', 0.02), ('stems', 0.02), ('converge', 0.019), ('entry', 0.019), ('infinite', 0.019), ('playing', 0.019), ('mainly', 0.018), ('behavior', 0.018), ('algorithms', 0.018), ('selected', 0.017), ('phase', 0.017), ('ai', 0.017), ('sketch', 0.017), ('unknown', 0.017), ('notion', 0.017), ('theorem', 0.017), ('author', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 78 nips-2002-Efficient Learning Equilibrium

Author: Ronen I. Brafman, Moshe Tennenholtz

Abstract: We introduce efficient learning equilibrium (ELE), a normative approach to learning in non cooperative settings. In ELE, the learning algorithms themselves are required to be in equilibrium. In addition, the learning algorithms arrive at a desired value after polynomial time, and deviations from a prescribed ELE become irrational after polynomial time. We prove the existence of an ELE in the perfect monitoring setting, where the desired value is the expected payoff in a Nash equilibrium. We also show that an ELE does not always exist in the imperfect monitoring case. Yet, it exists in the special case of common-interest games. Finally, we extend our results to general stochastic games. 1

2 0.52199703 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: Multiagent learning is a key problem in AI. In the presence of multiple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exaccerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two interrelated problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the first algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm’s parameters are easy to set to meet the convergence conditions.

3 0.32231757 152 nips-2002-Nash Propagation for Loopy Graphical Games

Author: Luis E. Ortiz, Michael Kearns

Abstract: We introduce NashProp, an iterative and local message-passing algorithm for computing Nash equilibria in multi-player games represented by arbitrary undirected graphs. We provide a formal analysis and experimental evidence demonstrating that NashProp performs well on large graphical games with many loops, often converging in just a dozen iterations on graphs with hundreds of nodes. NashProp generalizes the tree algorithm of (Kearns et al. 2001), and can be viewed as similar in spirit to belief propagation in probabilistic inference, and thus complements the recent work of (Vickrey and Koller 2002), who explored a junction tree approach. Thus, as for probabilistic inference, we have at least two promising general-purpose approaches to equilibria computation in graphs.

4 0.19946125 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.

5 0.15085787 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

6 0.10324309 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

7 0.083356515 134 nips-2002-Learning to Take Concurrent Actions

8 0.071558051 185 nips-2002-Speeding up the Parti-Game Algorithm

9 0.05680744 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

10 0.049332112 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

11 0.047576558 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

12 0.04579699 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

13 0.041935917 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems

14 0.039231818 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming

15 0.036991853 20 nips-2002-Adaptive Caching by Refetching

16 0.035950214 205 nips-2002-Value-Directed Compression of POMDPs

17 0.035800334 199 nips-2002-Timing and Partial Observability in the Dopamine System

18 0.034185879 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

19 0.033883661 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

20 0.030849356 163 nips-2002-Prediction and Semantic Association


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.1), (1, -0.017), (2, -0.324), (3, -0.106), (4, -0.035), (5, -0.131), (6, 0.066), (7, -0.162), (8, -0.136), (9, -0.325), (10, 0.25), (11, 0.355), (12, 0.27), (13, -0.11), (14, -0.001), (15, 0.075), (16, -0.078), (17, 0.036), (18, -0.072), (19, 0.085), (20, 0.106), (21, -0.03), (22, 0.14), (23, 0.063), (24, 0.096), (25, 0.03), (26, -0.036), (27, -0.045), (28, -0.015), (29, 0.002), (30, 0.004), (31, 0.033), (32, -0.05), (33, 0.015), (34, -0.048), (35, -0.021), (36, 0.021), (37, 0.002), (38, -0.013), (39, -0.007), (40, -0.0), (41, -0.016), (42, 0.01), (43, -0.0), (44, -0.002), (45, -0.033), (46, -0.051), (47, 0.019), (48, 0.003), (49, 0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98274821 78 nips-2002-Efficient Learning Equilibrium

Author: Ronen I. Brafman, Moshe Tennenholtz

Abstract: We introduce efficient learning equilibrium (ELE), a normative approach to learning in non cooperative settings. In ELE, the learning algorithms themselves are required to be in equilibrium. In addition, the learning algorithms arrive at a desired value after polynomial time, and deviations from a prescribed ELE become irrational after polynomial time. We prove the existence of an ELE in the perfect monitoring setting, where the desired value is the expected payoff in a Nash equilibrium. We also show that an ELE does not always exist in the imperfect monitoring case. Yet, it exists in the special case of common-interest games. Finally, we extend our results to general stochastic games. 1

2 0.95033687 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: Multiagent learning is a key problem in AI. In the presence of multiple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exaccerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two interrelated problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the first algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm’s parameters are easy to set to meet the convergence conditions.

3 0.81307137 152 nips-2002-Nash Propagation for Loopy Graphical Games

Author: Luis E. Ortiz, Michael Kearns

Abstract: We introduce NashProp, an iterative and local message-passing algorithm for computing Nash equilibria in multi-player games represented by arbitrary undirected graphs. We provide a formal analysis and experimental evidence demonstrating that NashProp performs well on large graphical games with many loops, often converging in just a dozen iterations on graphs with hundreds of nodes. NashProp generalizes the tree algorithm of (Kearns et al. 2001), and can be viewed as similar in spirit to belief propagation in probabilistic inference, and thus complements the recent work of (Vickrey and Koller 2002), who explored a junction tree approach. Thus, as for probabilistic inference, we have at least two promising general-purpose approaches to equilibria computation in graphs.

4 0.5480693 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.

5 0.30193067 185 nips-2002-Speeding up the Parti-Game Algorithm

Author: Maxim Likhachev, Sven Koenig

Abstract: In this paper, we introduce an efficient replanning algorithm for nondeterministic domains, namely what we believe to be the first incremental heuristic minimax search algorithm. We apply it to the dynamic discretization of continuous domains, resulting in an efficient implementation of the parti-game reinforcement-learning algorithm for control in high-dimensional domains.

6 0.293138 134 nips-2002-Learning to Take Concurrent Actions

7 0.28761491 3 nips-2002-A Convergent Form of Approximate Policy Iteration

8 0.28720677 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

9 0.13635872 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems

10 0.12979329 205 nips-2002-Value-Directed Compression of POMDPs

11 0.1288781 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution

12 0.12685885 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

13 0.11694548 163 nips-2002-Prediction and Semantic Association

14 0.115774 167 nips-2002-Rational Kernels

15 0.11551568 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

16 0.10555604 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus

17 0.10550541 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

18 0.10373394 199 nips-2002-Timing and Partial Observability in the Dopamine System

19 0.1014838 107 nips-2002-Identity Uncertainty and Citation Matching

20 0.10114864 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.011), (14, 0.01), (23, 0.021), (42, 0.04), (54, 0.061), (55, 0.018), (57, 0.01), (68, 0.01), (74, 0.608), (92, 0.025), (98, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98149735 57 nips-2002-Concurrent Object Recognition and Segmentation by Graph Partitioning

Author: Stella X. Yu, Ralph Gross, Jianbo Shi

Abstract: Segmentation and recognition have long been treated as two separate processes. We propose a mechanism based on spectral graph partitioning that readily combine the two processes into one. A part-based recognition system detects object patches, supplies their partial segmentations as well as knowledge about the spatial configurations of the object. The goal of patch grouping is to find a set of patches that conform best to the object configuration, while the goal of pixel grouping is to find a set of pixels that have the best low-level feature similarity. Through pixel-patch interactions and between-patch competition encoded in the solution space, these two processes are realized in one joint optimization problem. The globally optimal partition is obtained by solving a constrained eigenvalue problem. We demonstrate that the resulting object segmentation eliminates false positives for the part detection, while overcoming occlusion and weak contours for the low-level edge detection.

2 0.95852977 114 nips-2002-Information Regularization with Partially Labeled Data

Author: Martin Szummer, Tommi S. Jaakkola

Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.

same-paper 3 0.95429039 78 nips-2002-Efficient Learning Equilibrium

Author: Ronen I. Brafman, Moshe Tennenholtz

Abstract: We introduce efficient learning equilibrium (ELE), a normative approach to learning in non cooperative settings. In ELE, the learning algorithms themselves are required to be in equilibrium. In addition, the learning algorithms arrive at a desired value after polynomial time, and deviations from a prescribed ELE become irrational after polynomial time. We prove the existence of an ELE in the perfect monitoring setting, where the desired value is the expected payoff in a Nash equilibrium. We also show that an ELE does not always exist in the imperfect monitoring case. Yet, it exists in the special case of common-interest games. Finally, we extend our results to general stochastic games. 1

4 0.95226103 66 nips-2002-Developing Topography and Ocular Dominance Using Two aVLSI Vision Sensors and a Neurotrophic Model of Plasticity

Author: Terry Elliott, Jörg Kramer

Abstract: A neurotrophic model for the co-development of topography and ocular dominance columns in the primary visual cortex has recently been proposed. In the present work, we test this model by driving it with the output of a pair of neuronal vision sensors stimulated by disparate moving patterns. We show that the temporal correlations in the spike trains generated by the two sensors elicit the development of refined topography and ocular dominance columns, even in the presence of significant amounts of spontaneous activity and fixed-pattern noise in the sensors.

5 0.94470656 90 nips-2002-Feature Selection in Mixture-Based Clustering

Author: Martin H. Law, Anil K. Jain, Mário Figueiredo

Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.

6 0.88835251 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text

7 0.73692781 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

8 0.69945061 39 nips-2002-Bayesian Image Super-Resolution

9 0.66879058 152 nips-2002-Nash Propagation for Loopy Graphical Games

10 0.66783392 89 nips-2002-Feature Selection by Maximum Marginal Diversity

11 0.6665343 182 nips-2002-Shape Recipes: Scene Representations that Refer to the Image

12 0.66449684 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

13 0.66162413 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

14 0.65530926 74 nips-2002-Dynamic Structure Super-Resolution

15 0.64468753 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

16 0.62936169 173 nips-2002-Recovering Intrinsic Images from a Single Image

17 0.61921763 135 nips-2002-Learning with Multiple Labels

18 0.61040044 2 nips-2002-A Bilinear Model for Sparse Coding

19 0.60956573 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering

20 0.60521621 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm