nips nips2001 nips2001-146 knowledge-graph by maker-knowledge-mining

146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning


Source: pdf

Author: Yu-Han Chang, Leslie Pack Kaelbling

Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. [sent-5, score-0.37]

2 We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents. [sent-7, score-0.353]

3 Game theorists have begun to examine learning models in their study of repeated games, and reinforcement learning researchers have begun to extend their singleagent learning models to the multiple-agent case. [sent-9, score-0.271]

4 From the game theory perspective, the repeated game is a generalization of the traditional one-shot game, or matrix game. [sent-13, score-0.54]

5 The matrix game is defined as a reward matrix Ri for each player, Ri : A1 × A2 → R, where Ai is the set of actions available to player i. [sent-14, score-0.568]

6 Purely competitive games are called zero-sum games and must satisfy R1 = −R2 . [sent-15, score-0.364]

7 Each player simultaneously chooses to play a particular action ai ∈ Ai , or a mixed policy µi = P D(Ai ), which is a probability distribution over the possible actions, and receives reward based on the joint action taken. [sent-16, score-0.647]

8 Some common examples of single-shot matrix games are shown in Figure 1. [sent-17, score-0.203]

9 The traditional assumption is that each player has no prior knowledge about the other player. [sent-18, score-0.205]

10 As is standard in the game theory literature, it is thus reasonable to assume that the opponent is fully rational and chooses actions that are in its best interest. [sent-19, score-0.974]

11 then the game is said to be in Nash equilibrium. [sent-23, score-0.23]

12 Once all players are playing by a Nash equilibrium, no single player has an incentive to unilaterally deviate from his equilibrium policy. [sent-24, score-0.663]

13 Any game can be solved for its Nash equilibria using quadratic programming, and a player can choose an optimal strategy in this fashion, given prior knowledge of the game structure. [sent-25, score-0.786]

14 If the players do not manage to coordinate on one equilibrium joint policy, then they may all end up worse off. [sent-27, score-0.362]

15 The Hawk-Dove game shown in Figure 1(c) is a good example of this problem. [sent-28, score-0.23]

16 The two Nash equilibria occur at (1,2) and (2,1), but if the players do not coordinate, they may end up playing a joint action (1,1) and receive 0 reward. [sent-29, score-0.377]

17 Despite these problems, there is general agreement that Nash equilibrium is an appropriate solution concept for one-shot games. [sent-31, score-0.165]

18 In contrast, for repeated games there are a range of different perspectives. [sent-32, score-0.241]

19 Repeated games generalize one-shot games by assuming that the players repeat the matrix game over many time periods. [sent-33, score-0.833]

20 Researchers in reinforcement learning view repeated games as a special case of stochastic, or Markov, games. [sent-34, score-0.309]

21 Researchers in game theory, on the other hand, view repeated games as an extension of their theory of one-shot matrix games. [sent-35, score-0.492]

22 The resulting frameworks are similar, but with a key difference in their treatment of game history. [sent-36, score-0.23]

23 Reinforcement learning researchers focus their attention on choosing a single stationary policy µ that will maximize the learner’s expected rewards in all future time periods given T τ −t τ that we are in time t, maxµ Eµ R (µ) , where T may be finite or infinite, τ =t γ and µ = P D(A). [sent-37, score-0.39]

24 Littman [1] analyzes this framework for zero-sum games, proving convergence to the Nash equilibrium for his minimax-Q algorithm playing against another minimax-Q agent. [sent-39, score-0.261]

25 Claus and Boutilier [2] examine cooperative games where R1 = R2 , and Hu and Wellman [3] focus on general-sum games. [sent-40, score-0.203]

26 Littman [4] and Hall and Greenwald [5] further extend this approach to consider variants of Nash equilibrium for which convergence can be guaranteed. [sent-42, score-0.165]

27 [7] propose to relax the mutual optimality requirement of Nash equilibrium by considering rational agents, which always learn to play a stationary best-response to their opponent’s strategy, even if the opponent is not playing an equilibrium strategy. [sent-44, score-1.339]

28 The motivation is that it allows our agents to act rationally even if the opponent is not acting rationally because of physical or computational limitations. [sent-45, score-0.765]

29 Fictitious play [8] is a similar algorithm from game theory. [sent-46, score-0.334]

30 As alluded to in the previous section, game theorists often take a more general view of optimality in repeated games. [sent-48, score-0.343]

31 H0 H1 H∞ B0 minimax-Q, Nash-Q B1 Q-learning (Q0 ), (WoLF-)PHC, fictitious play Q1 B∞ Bully Godfather multiplicativeweight* * assumes public knowledge of the opponent’s policy at each period stochastic game model, we took µi = P D(Ai ). [sent-51, score-0.518]

32 Moreover, the agent ought to be able to form beliefs about the opponent’s strategy, and these beliefs ought to converge to the opponent’s actual strategy given sufficient learning time. [sent-56, score-0.444]

33 Let βi : H → P D(A−i ) be player i’s belief about the opponent’s strategy. [sent-57, score-0.237]

34 Now we can define the Nash equilibrium of a repeated game in terms of our personal strategy and our beliefs about the opponent. [sent-59, score-0.626]

35 If this holds for all players in the game, then we are guaranteed to be in Nash equilibrium. [sent-61, score-0.197]

36 } converges to a Nash equilibrium iff the following two conditions hold: • Optimization: ∀t, µi (ht−1 ) ∈ BRi (βi (ht−1 )). [sent-67, score-0.165]

37 We can never design an agent that will learn to both predict the opponent’s future strategy and optimize over those beliefs at the same time. [sent-72, score-0.28]

38 Despite this fact, if we assume some extra knowledge about the opponent, we can design an algorithm that approximates the best-response stationary policy over time against any opponent. [sent-73, score-0.214]

39 In the game theory literature, this concept is often called universal consistency. [sent-74, score-0.23]

40 Fudenburg and Levine [8] and Freund and Schapire [10] independently show that a multiplicativeweight algorithm exhibits universal consistency from the game theory and machine learning perspectives. [sent-75, score-0.3]

41 2 A new classification and a new algorithm We propose a general classification that categorizes algorithms by the cross-product of their possible strategies and their possible beliefs about the opponent’s strategy, H × B. [sent-78, score-0.174]

42 Given more memory, the agent can formulate more complex policies, since policies are maps from histories to action distributions. [sent-80, score-0.229]

43 H0 agents are memoryless and can only play stationary policies. [sent-81, score-0.271]

44 At the other extreme, H∞ agents have unbounded memory and can formulate ever more complex strategies as the game is played over time. [sent-83, score-0.407]

45 Agents that believe their opponent is memoryless are classified as B0 players, Bt players believe that the opponent bases its strategy on the previous tperiods of play, and so forth. [sent-85, score-1.661]

46 Although not explicitly stated, most existing algorithms make assumptions and thus hold beliefs about the types of possible opponents in the world. [sent-86, score-0.235]

47 We can think of each Hs × Bt as a different league of players, with players in each league roughly equal to one another in terms of their capabilities. [sent-87, score-0.387]

48 Clearly some leagues contain less capable players than others. [sent-88, score-0.224]

49 We can thus define a fair opponent as an opponent from an equal or lesser league. [sent-89, score-1.38]

50 Within each league, we assume that players are fully rational in the sense that they can fully use their available histories to construct their future policy. [sent-92, score-0.311]

51 If we believe that our opponent is a memoryless player, then even if we are an H∞ player, our fully rational strategy is to simply model the opponent’s stationary strategy and play our stationary best response. [sent-94, score-1.165]

52 As discussed in the previous section, the problem with these players is that even though they have full access to the history, their fully rational strategy is stationary due to their limited belief set. [sent-100, score-0.433]

53 A general example of a H∞ × B0 player is the policy hill climber (PHC). [sent-101, score-0.352]

54 It maintains a policy and updates the policy based upon its history in an attempt to maximize its rewards. [sent-102, score-0.307]

55 From state s, select action a according to the mixed policy µi (s) with some exploration. [sent-110, score-0.192]

56 ” True to their league, PHC players play well against stationary opponents. [sent-117, score-0.367]

57 At the opposite end of the spectrum, Littman and Stone [11] propose algorithms in H 0 ×B∞ and H1 × B∞ that are leader strategies in the sense that they choose a fixed strategy and hope that their opponent will “follow” by learning a best response to that fixed strategy. [sent-118, score-0.83]

58 Their “Bully” algorithm chooses a fixed memoryless stationary policy, while “Godfather” has memory of the last time period. [sent-119, score-0.152]

59 Opponents included normal Q-learning and Q 1 players, which are similar to Q-learners except that they explicitly learn using one period of memory because they believe that their opponent is also using memory to learn. [sent-120, score-0.759]

60 The interesting result is that “Godfather” is able to achieve non-stationary equilibria against Q 1 in the repeated prisoner’s dilemna game, with rewards for both players that are higher than the stationary Nash equilibrium rewards. [sent-121, score-0.611]

61 Finally, Hu and Wellman’s Nash-Q and Littman’s minimax-Q are classified as H 0 × B0 players, because even though they attempt to learn the Nash equilibrium through experience, their play is fixed once this equilibrium has been learned. [sent-125, score-0.434]

62 Furthermore, they assume that the opponent also plays a fixed stationary Nash equilibrium, which they hope is the other half of their own equilibrium strategy. [sent-126, score-0.881]

63 As discussed above, most existing algorithms do not form beliefs about the opponent beyond B0 . [sent-129, score-0.804]

64 We wish to open the door to such possibilities by designing learners that can model the opponent and use that information to achieve better rewards. [sent-131, score-0.67]

65 Ideally we would like to design an algorithm in H∞ × B∞ that is able to win or come to an equilibrium against any fair opponent. [sent-132, score-0.245]

66 Since many of the current algorithms are best-response players, we choose an opponent class such as PHC, which is a good example of a best-response player in H∞ × B0 . [sent-134, score-0.879]

67 We will demonstrate that our algorithm indeed beats its PHC opponents and in fact does well against most of the existing fair opponents. [sent-135, score-0.197]

68 In particular, we would like to model players from H ∞ × B0 . [sent-138, score-0.197]

69 Since we are in H∞ × B∞ , it is rational for us to construct such models because we believe that the opponent is learning and adapting to us over time using its history. [sent-139, score-0.776]

70 The idea is that we will “fool” our opponent into thinking that we are stupid by playing a decoy policy for a number of time periods and then switch to a different policy that takes advantage of their best response to our decoy policy. [sent-140, score-1.171]

71 From a learning perspective, the idea is that we adapt much faster than the opponent; in fact, when we switch away from our decoy policy, our adjustment to the new policy is immediate. [sent-141, score-0.19]

72 In contrast, the H∞ × B0 opponent adjusts its policy by small increments and is furthermore unable to model our changing behavior. [sent-142, score-0.777]

73 The opponent never catches on to us because it believes that we only play stationary policies. [sent-144, score-0.82]

74 Bowling and Veloso showed that in selfplay, a restricted version of WoLF-PHC always reaches a stationary Nash equilibrium in two-player two-action games, and that the general WoLF-PHC seems to do the same in experimental trials. [sent-146, score-0.231]

75 Thus, in the long run, a WoLF-PHC player achieves its stationary Nash equilibrium payoff against any other PHC player. [sent-147, score-0.479]

76 Observing action at at time t, update our history h and calculate an estimate of the −i opponent’s policy: t #(h[τ ] = a) µt (s, a) = τ =t−w ˆ−i ∀a, w where w is the window of estimation and #(h[τ ] = a) = 1 if the opponent’s action at time τ is equal to a, and 0 otherwise. [sent-150, score-0.177]

77 otherwise Note that we derive both the opponent’s learning rate δ and the opponent’s policy µ −i (s) ˆ from estimates using the observable history of actions. [sent-159, score-0.202]

78 If we assume the game matrix is public information, then we can solve for the equilibrium strategy µ ∗ (s), otherwise we can ˆi run WoLF-PHC for some finite number of time periods to obtain an estimate this equilibrium strategy. [sent-160, score-0.769]

79 The PHC-Exploiter algorithm is based upon PHC and thus exhibits the same behavior as PHC in games with a single pure Nash equilibrium. [sent-163, score-0.203]

80 Both agents generally converge to the single pure equilibrium point. [sent-164, score-0.226]

81 The interesting case arises in competitive games where the only equilibria require mixed strategies, as discussed by Singh et al [12] and Bowling and Veloso [6]. [sent-165, score-0.249]

82 In the full knowledge case where we know our opponent’s policy µ 2 and learning rate δ2 at every time period, we can prove that a PHC-Exploiter learning algorithm will guarantee us unbounded reward in the long run playing games such as matching pennies. [sent-168, score-0.596]

83 In the zero-sum game of matching pennies, where the only Nash equilibrium requires the use of mixed strategies, PHC-Exploiter is able to achieve unbounded rewards as t → ∞ against any PHC opponent given that play follows the cycle C defined by the arrowed segments shown in Figure 2. [sent-171, score-1.306]

84 Given a point (x, y) = (µ 1 (H), µ2 (H)) on the graph in Figure 2, where µi (H) is the probability by which player i plays Heads, we know that our expected reward is R1 (x, y) = −1 × [(x)(y) + (1 − x)(1 − y)] + 1 × [(1 − x)(y) + (x)(1 − y)]. [sent-175, score-0.262]

85 Action distribution of the two agent system Action distribution of the two agent system 1. [sent-176, score-0.216]

86 The cyclic play is evident in our empirical results, where we play a PHC-Exploiter player 1 against a PHC player 2. [sent-189, score-0.653]

87 Agent 1 total reward over time 8000 6000 total reward 4000 2000 0 -2000 -4000 0 20000 40000 60000 80000 100000 time period Figure 3: Total rewards for agent 1 increase as we gain reward through each cycle. [sent-190, score-0.397]

88 Thus, C R1 (x, y)dt = 1/3 > 0, and we have shown that we receive a payoff greater than the Nash equilibrium payoff of zero over every cycle. [sent-196, score-0.251]

89 We used the PHC-Exploiter algorithm described above to play against several PHC variants in different iterated matrix games, including matching pennies, prisoner’s dilemna, and rock-paper-scissors. [sent-200, score-0.159]

90 Here we give the results for the matching pennies game analyzed above, playing against WoLF-PHC. [sent-201, score-0.428]

91 We used a window of w = 5000 time periods to estimate the opponent’s current policy µ 2 and the opponent’s learning rate δ2 . [sent-202, score-0.238]

92 As shown in Figure 2, the play exhibits the cyclic nature that we predicted. [sent-203, score-0.16]

93 The two solid vertical lines indicate periods in which our PHC-Exploiter player is winning, and the dashed, roughly diagonal, lines indicate periods in which it is losing. [sent-204, score-0.341]

94 Figure 3 plots the total reward for our PHC-Exploiter agent over time. [sent-209, score-0.165]

95 The periods of winning and losing are very clear from this graph. [sent-210, score-0.177]

96 Against all the existing fair opponents shown in Table 1, it achieved at least its average equilibrium payoff in the long-run. [sent-212, score-0.405]

97 In this paper, we have presented a new classification for multi-agent learning algorithms and suggested an algorithm that seems to dominate existing algorithms from the fair opponent leagues when playing certain games. [sent-215, score-0.959]

98 Ideally, we would like to create an algorithm in the league H∞ × B∞ that provably dominates larger classes of fair opponents in any game. [sent-216, score-0.256]

99 We would like to extend our framework to more general stochastic games with multiple states and multiple players. [sent-218, score-0.182]

100 Markov games as a framework for multi-agent reinforcement learning. [sent-224, score-0.228]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('opponent', 0.65), ('nash', 0.282), ('phc', 0.244), ('game', 0.23), ('player', 0.205), ('players', 0.197), ('games', 0.182), ('equilibrium', 0.165), ('policy', 0.127), ('agent', 0.108), ('play', 0.104), ('playing', 0.096), ('league', 0.095), ('beliefs', 0.094), ('ht', 0.085), ('opponents', 0.081), ('fair', 0.08), ('strategy', 0.078), ('winning', 0.071), ('periods', 0.068), ('bowling', 0.068), ('pennies', 0.068), ('stationary', 0.066), ('agents', 0.061), ('rational', 0.06), ('repeated', 0.059), ('reward', 0.057), ('strategies', 0.056), ('godfather', 0.054), ('veloso', 0.054), ('histories', 0.054), ('history', 0.053), ('heads', 0.05), ('ai', 0.048), ('reinforcement', 0.046), ('economic', 0.043), ('equilibria', 0.043), ('payoff', 0.043), ('classi', 0.041), ('bri', 0.041), ('decoy', 0.041), ('dilemna', 0.041), ('prisoner', 0.041), ('action', 0.041), ('memoryless', 0.04), ('rewards', 0.04), ('littman', 0.038), ('losing', 0.038), ('ri', 0.037), ('existing', 0.036), ('dt', 0.036), ('period', 0.036), ('cyclic', 0.035), ('unbounded', 0.035), ('matching', 0.034), ('actions', 0.034), ('cw', 0.033), ('optimality', 0.033), ('belief', 0.032), ('cl', 0.031), ('argmaxa', 0.03), ('hu', 0.03), ('massachusetts', 0.027), ('begun', 0.027), ('bully', 0.027), ('claus', 0.027), ('ctitious', 0.027), ('fudenburg', 0.027), ('leagues', 0.027), ('multiplicativeweight', 0.027), ('nachbar', 0.027), ('nagayuki', 0.027), ('rationally', 0.027), ('wellman', 0.027), ('policies', 0.026), ('researchers', 0.025), ('memory', 0.025), ('mixed', 0.024), ('algorithms', 0.024), ('cycle', 0.024), ('payoffs', 0.024), ('ought', 0.024), ('believe', 0.023), ('mi', 0.023), ('proceeds', 0.022), ('learning', 0.022), ('theorists', 0.021), ('public', 0.021), ('multiagent', 0.021), ('cooperative', 0.021), ('michael', 0.021), ('time', 0.021), ('exhibits', 0.021), ('ideally', 0.021), ('matrix', 0.021), ('cial', 0.02), ('arti', 0.02), ('learners', 0.02), ('hill', 0.02), ('execute', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

Author: Yu-Han Chang, Leslie Pack Kaelbling

Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.

2 0.27222869 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

Author: Michael L. Littman, Michael J. Kearns, Satinder P. Singh

Abstract: We describe a new algorithm for computing a Nash equilibrium in graphical games, a compact representation for multi-agent systems that we introduced in previous work. The algorithm is the first to compute equilibria both efficiently and exactly for a non-trivial class of graphical games. 1

3 0.22672704 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

4 0.12424953 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

5 0.1228475 128 nips-2001-Multiagent Planning with Factored MDPs

Author: Carlos Guestrin, Daphne Koller, Ronald Parr

Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.

6 0.1113672 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

7 0.10342961 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

8 0.097037748 121 nips-2001-Model-Free Least-Squares Policy Iteration

9 0.077859454 40 nips-2001-Batch Value Function Approximation via Support Vectors

10 0.072648786 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

11 0.071700715 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

12 0.069963358 59 nips-2001-Direct value-approximation for factored MDPs

13 0.054642349 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

14 0.050215065 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

15 0.043901317 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

16 0.039973684 112 nips-2001-Learning Spike-Based Correlations and Conditional Probabilities in Silicon

17 0.039914906 148 nips-2001-Predictive Representations of State

18 0.039535135 126 nips-2001-Motivated Reinforcement Learning

19 0.03916496 36 nips-2001-Approximate Dynamic Programming via Linear Programming

20 0.037528187 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.129), (1, -0.081), (2, 0.23), (3, 0.04), (4, 0.01), (5, 0.069), (6, -0.052), (7, -0.083), (8, 0.037), (9, 0.003), (10, 0.004), (11, -0.116), (12, -0.035), (13, -0.072), (14, -0.079), (15, -0.012), (16, -0.011), (17, 0.048), (18, -0.115), (19, 0.041), (20, -0.262), (21, 0.106), (22, 0.018), (23, 0.378), (24, 0.075), (25, 0.024), (26, -0.255), (27, -0.022), (28, -0.085), (29, -0.161), (30, -0.061), (31, -0.139), (32, 0.006), (33, 0.02), (34, 0.088), (35, -0.05), (36, -0.003), (37, -0.033), (38, -0.036), (39, 0.004), (40, 0.094), (41, -0.041), (42, 0.06), (43, -0.063), (44, -0.06), (45, -0.032), (46, 0.014), (47, -0.065), (48, -0.033), (49, -0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95896977 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

Author: Yu-Han Chang, Leslie Pack Kaelbling

Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.

2 0.92037386 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

Author: Michael L. Littman, Michael J. Kearns, Satinder P. Singh

Abstract: We describe a new algorithm for computing a Nash equilibrium in graphical games, a compact representation for multi-agent systems that we introduced in previous work. The algorithm is the first to compute equilibria both efficiently and exactly for a non-trivial class of graphical games. 1

3 0.67555493 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

4 0.4584676 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

Author: Jeff Bilmes, Gang Ji, Marina Meila

Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.

5 0.30524194 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

6 0.27376685 13 nips-2001-A Natural Policy Gradient

7 0.25545815 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

8 0.23079905 93 nips-2001-Incremental A*

9 0.22807531 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

10 0.21696475 128 nips-2001-Multiagent Planning with Factored MDPs

11 0.20914152 177 nips-2001-Switch Packet Arbitration via Queue-Learning

12 0.20680846 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

13 0.1877872 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

14 0.18579033 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task

15 0.17887475 59 nips-2001-Direct value-approximation for factored MDPs

16 0.17089581 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

17 0.1606212 140 nips-2001-Optimising Synchronisation Times for Mobile Devices

18 0.16028719 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning

19 0.15813175 126 nips-2001-Motivated Reinforcement Learning

20 0.15507148 161 nips-2001-Reinforcement Learning with Long Short-Term Memory


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.011), (14, 0.016), (17, 0.019), (19, 0.027), (27, 0.088), (30, 0.057), (38, 0.024), (59, 0.022), (67, 0.327), (72, 0.071), (79, 0.045), (83, 0.093), (91, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8044039 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

Author: Yu-Han Chang, Leslie Pack Kaelbling

Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.

2 0.75230873 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding

Author: Alberto Paccanaro, Geoffrey E. Hinton

Abstract: We present Linear Relational Embedding (LRE), a new method of learning a distributed representation of concepts from data consisting of instances of relations between given concepts. Its final goal is to be able to generalize, i.e. infer new instances of these relations among the concepts. On a task involving family relationships we show that LRE can generalize better than any previously published method. We then show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. 1 Linear Relational Embedding Our aim is to take a large set of facts about a domain expressed as tuples of arbitrary symbols in a simple and rigid syntactic format and to be able to infer other “common-sense” facts without having any prior knowledge about the domain. Let us imagine a situation in which we have a set of concepts and a set of relations among these concepts, and that our data consists of few instances of these relations that hold among the concepts. We want to be able to infer other instances of these relations. For example, if the concepts are the people in a certain family, the relations are kinship relations, and we are given the facts ”Alberto has-father Pietro” and ”Pietro has-brother Giovanni”, we would like to be able to infer ”Alberto has-uncle Giovanni”. Our approach is to learn appropriate distributed representations of the entities in the data, and then exploit the generalization properties of the distributed representations [2] to make the inferences. In this paper we present a method, which we have called Linear Relational Embedding (LRE), which learns a distributed representation for the concepts by embedding them in a space where the relations between concepts are linear transformations of their distributed representations. Let us consider the case in which all the relations are binary, i.e. involve two concepts. , and the problem In this case our data consists of triplets we are trying to solve is to infer missing triplets when we are given only few of them. Inferring a triplet is equivalent to being able to complete it, that is to come up with one of its elements, given the other two. Here we shall always try to complete the third element of the triplets 1 . LRE will then represent each concept in the data as a learned vector in a 2 0    £ § ¥ £  § ¥ % 

3 0.64299047 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds

Author: B. D. Wright, Kamal Sen, William Bialek, A. J. Doupe

Abstract: In nature, animals encounter high dimensional sensory stimuli that have complex statistical and dynamical structure. Attempts to study the neural coding of these natural signals face challenges both in the selection of the signal ensemble and in the analysis of the resulting neural responses. For zebra finches, naturalistic stimuli can be defined as sounds that they encounter in a colony of conspecific birds. We assembled an ensemble of these sounds by recording groups of 10-40 zebra finches, and then analyzed the response of single neurons in the songbird central auditory area (field L) to continuous playback of long segments from this ensemble. Following methods developed in the fly visual system, we measured the information that spike trains provide about the acoustic stimulus without any assumptions about which features of the stimulus are relevant. Preliminary results indicate that large amounts of information are carried by spike timing, with roughly half of the information accessible only at time resolutions better than 10 ms; additional information is still being revealed as time resolution is improved to 2 ms. Information can be decomposed into that carried by the locking of individual spikes to the stimulus (or modulations of spike rate) vs. that carried by timing in spike patterns. Initial results show that in field L, temporal patterns give at least % extra information. Thus, single central auditory neurons can provide an informative representation of naturalistic sounds, in which spike timing may play a significant role.   

4 0.5073868 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

5 0.50621092 43 nips-2001-Bayesian time series classification

Author: Peter Sykacek, Stephen J. Roberts

Abstract: This paper proposes an approach to classification of adjacent segments of a time series as being either of classes. We use a hierarchical model that consists of a feature extraction stage and a generative classifier which is built on top of these features. Such two stage approaches are often used in signal and image processing. The novel part of our work is that we link these stages probabilistically by using a latent feature space. To use one joint model is a Bayesian requirement, which has the advantage to fuse information according to its certainty. The classifier is implemented as hidden Markov model with Gaussian and Multinomial observation distributions defined on a suitably chosen representation of autoregressive models. The Markov dependency is motivated by the assumption that successive classifications will be correlated. Inference is done with Markov chain Monte Carlo (MCMC) techniques. We apply the proposed approach to synthetic data and to classification of EEG that was recorded while the subjects performed different cognitive tasks. All experiments show that using a latent feature space results in a significant improvement in generalization accuracy. Hence we expect that this idea generalizes well to other hierarchical models.

6 0.49646714 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

7 0.49278679 105 nips-2001-Kernel Machines and Boolean Functions

8 0.47633299 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

9 0.47233981 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

10 0.46806625 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

11 0.46731526 13 nips-2001-A Natural Policy Gradient

12 0.46675432 56 nips-2001-Convolution Kernels for Natural Language

13 0.46608278 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

14 0.46309227 121 nips-2001-Model-Free Least-Squares Policy Iteration

15 0.46280515 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

16 0.45849067 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

17 0.45783022 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

18 0.45745718 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

19 0.45718667 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

20 0.45678759 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs