nips nips2003 nips2003-65 knowledge-graph by maker-knowledge-mining

65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

Source: pdf

Author: Gerald Tesauro

Abstract: Recent multi-agent extensions of Q-Learning require knowledge of other agents’ payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. This paper proposes a fundamentally different approach, dubbed “Hyper-Q” Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents’ strategies are estimated from observed actions via Bayesian inference. Hyper-Q may be effective against many different types of adaptive agents, even if they are persistently dynamic. Against certain broad categories of adaptation, it is argued that Hyper-Q may converge to exact optimal time-varying policies. In tests using Rock-Paper-Scissors, Hyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent (IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary analysis of Hyper-Q against itself is also presented. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Recent multi-agent extensions of Q-Learning require knowledge of other agents’ payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. [sent-4, score-0.203]

2 This paper proposes a fundamentally different approach, dubbed “Hyper-Q” Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents’ strategies are estimated from observed actions via Bayesian inference. [sent-5, score-0.967]

3 1 Introduction The question of how agents may adapt their strategic behavior while interacting with other arbitrarily adapting agents is a major challenge in both machine learning and multi-agent systems research. [sent-10, score-0.448]

4 While game theory provides a pricipled calculation of Nash equilibrium strategies, it is limited in practical use due to hidden or imperfect state information, and computational intractability. [sent-11, score-0.22]

5 Trial-and-error learning could develop good strategies by trying many actions in a number of environmental states, and observing which actions, in combination with actions of other agents, lead to high cumulative reward. [sent-12, score-0.478]

6 This is highly effective for a single learner in a stationary environment, where algorithms such as QLearning [13] are able to learn optimal policies on-line without a model of the environment. [sent-13, score-0.211]

7 Straight off-the-shelf use of RL algorithms such as Q-learning is problematic, however, because: (a) they learn deterministic policies, whereas mixed strategies are generally needed; (b) the environment is generally non-stationary due to adaptation of other agents. [sent-14, score-0.482]

8 These all extend the normal Q-function of state-action pairs Q(s, a) to a function of states and joint actions of all agents, Q(s, a). [sent-19, score-0.183]

9 These include: (1) other agents’ payoffs are fully observable; (2) all agents use the same learning algorithm; (3) during learning, other agents’ strategies are derivable via game-theoretic analysis of the current Q-functions. [sent-21, score-0.564]

10 The multi-agent environment is modeled as a repeated stochastic game in which other agents’ actions are observable, but not their payoffs. [sent-24, score-0.312]

11 Other agents are assumed to learn, but the forms of their learning algorithms are unknown, and their strategies may be asymptotically non-stationary. [sent-25, score-0.435]

12 During learning, it is proposed to estimate other agents’ current strategies from observation instead of game-theoretic analysis. [sent-26, score-0.211]

13 ” Its key idea is to learn the value of joint mixed strategies, rather than joint base actions. [sent-28, score-0.309]

14 Section 3 discusses the effects of function approximation, exploration, and other agents’ strategy dynamics on Hyper-Q’s convergence. [sent-29, score-0.259]

15 Test results are presented against two recent algorithms for learning mixed strategies: Infinitesimal Gradient Ascent (IGA) [10], and Policy Hill Climbing (PHC) [2]. [sent-32, score-0.245]

16 2 General Hyper-Q formulation An agent using normal Q-learning in a finite MDP repeatedly observes a state s, chooses a legal action a, and then observes an immediate reward r and a transition to a new state s . [sent-36, score-0.687]

17 Given a suitable method of exploring state-action pairs, Q-learning is guaranteed to converge to the optimal value function Q∗ , and its associated greedy policy is thus an optimal policy π∗ . [sent-38, score-0.279]

18 The multi-agent generalization of an MDP is called a stochastic game, in which each agent i chooses an action ai in state s. [sent-39, score-0.304]

19 Payoffs ri to agent i and state transitions are now functions of joint actions of all agents. [sent-40, score-0.363]

20 An important special class of stochastic games are matrix games, in which |S| = 1 and payoffs are functions only of joint actions. [sent-41, score-0.233]

21 Rather than choosing the best action in a given state, an agent’s task in a stochastic game is to choose the best mixed strategy xi = xi (s) given the expected mixed strategy x−i (s) of all other agents. [sent-42, score-1.133]

22 Here xi denotes a set a probabilities summing to 1 for selecting each of the Ni = Ni (s) legal actions in state s. [sent-43, score-0.236]

23 The space of possible mixed strategies is a continuous (Ni − 1) dimensional unit simplex, and choosing the best mixed strategy is clearly more complex than choosing the best base action. [sent-44, score-1.008]

24 We now consider extensions of Q-learning to stochastic games. [sent-45, score-0.071]

25 For notational simplicity, let x denote the HyperQ learner’s current mixed strategy, and let y denote an estimated joint mixed strategy of all other agents (hereafter referred to as “opponents”). [sent-48, score-0.923]

26 At time t, the agent generates a base action according to x, and then observes a payoff r, a new state s , and a new estimated opponent strategy y . [sent-49, score-0.961]

27 The Hyper-Q function Q(s, y, x) is then adjusted according to: ∆Q(s, y, x) = α(t)[r + γ max Q(s , y , x ) − Q(s, y, x)] x (1) The greedy policy x associated with any Hyper-Q function is then defined by: ˆ x(s, y) = arg max Q(s, y, x) ˆ x 3 3. [sent-50, score-0.127]

28 1 (2) Convergence of Hyper-Q Learning Function approximation Since Hyper-Q is a function of continuous mixed strategies, one would expect it to require some sort of function approximation scheme. [sent-51, score-0.279]

29 Establishing convergence of Q-learning with function approximation is substantially more difficult than for a normal Q-table for a finite MDP, and there are a number of well-known counterexamples. [sent-52, score-0.214]

30 In particular, finite discretization may cause a loss of an MDP’s Markov property [9]. [sent-53, score-0.151]

31 There is a least one discretization scheme, Finite Difference Reinforcement Learning [9], that provably converges to the optimal value function of the underlying continuous MDP. [sent-55, score-0.115]

32 This paper employs a simple uniform grid discretization of the mixed strategies of the Hyper-Q agent and its opponents. [sent-56, score-0.736]

33 However, for certain types of opponent dynamics described below, a plausible conjecture is that a Finite-Difference-RL implementation of Hyper-Q will be provably convergent. [sent-58, score-0.371]

34 2 Exploration Convergence of normal Q-learning requires visiting every state-action pair infinitely often. [sent-60, score-0.117]

35 This will ensure that, for all visited states, every action will be tried infinitely often, but does not guarantee that all states will be visited infinitely often (unless the MDP has an ergodicity property). [sent-65, score-0.328]

36 As a result one would not expect the trained Q function to exactly match the ideal optimal Q∗ for the MDP, although the difference in expected payoffs of the respective policies should be vanishingly small. [sent-66, score-0.23]

37 The use of exploring starts for states, agent and opponent mixed strategies should guarantee sufficient exploration of the state-action space. [sent-68, score-1.055]

38 Without exploring starts, the agent can use -greedy exploration to at least obtain sufficient exploration of its own mixed strategy space. [sent-69, score-0.814]

39 If the opponents also do similar exploration, the situation should be equivalent to normal Qlearning, where some stochastic game states might not be visited infinitely often, but the cost in expected payoff should be vanishingly small. [sent-70, score-0.556]

40 If the opponents do not explore, the effect could be a further reduction in effective state space explored by the Hyper-Q agent (where “effective state” = stochastic game state plus opponent strategy state). [sent-71, score-1.142]

41 Again this should have a negligible effect on the agent’s long-run expected payoff relative to the policy that would have been learned with opponent exploration. [sent-72, score-0.45]

42 3 Opponent strategy dynamics Since opponent strategies can be governed by arbitrarily complicated dynamical rules, it seems unlikely that Hyper-Q learning will converge for arbitrary opponents. [sent-74, score-0.859]

43 Nevertheless, some broad categories can be identified under which convergence should be achievable. [sent-75, score-0.083]

44 One simple example is that of a stationary opponent strategy, i. [sent-76, score-0.378]

45 In this case, the stochastic game obviously reduces to an equivalent MDP with stationary state transitions and stationary payoffs, and with the appropriate conditions on exploration and learning rates, Hyper-Q will clearly converge to the optimal value function. [sent-79, score-0.521]

46 Another important broad class of dynamics consists of opponent strategies that evolve according to a fixed, history-independent rule depending only on themselves and not on actions of the Hyper-Q player, i. [sent-80, score-0.728]

47 This is a reasonable approximation for many-player games in which any individual has negligible “market impact,” or in which a player’s influence on another player occurs only through a global summarization function [6]. [sent-83, score-0.302]

48 In such cases the relevant population strategy representation need only express global summarizations of actitivy (e. [sent-84, score-0.248]

49 An example is the “Replicator Dynamics” model from evolutionary game theory [14], in which a strategy grows or decays in a population according to its fitness relative to the population average fitness. [sent-87, score-0.479]

50 This leads to a history independent first order differential equation y = f (y) for the population average strategy. [sent-88, score-0.133]

51 In such models, the Hyper-Q learner again ˙ faces an effective MDP in which the effective state (s, y) undergoes stationary historyindependent transitions, so that Hyper-Q should be able to converge. [sent-89, score-0.28]

52 A final interesting class of dynamics occurs when the opponent can accurately estimate the Hyper-Q strategy x, and then adapts its strategy using a fixed history-independent rule: yt+1 = f (s, yt , xt ). [sent-90, score-0.831]

53 This can occur if players are required to announce their mixed strategies, or if the Hyper-Q player voluntarily announces its strategy. [sent-91, score-0.524]

54 An example is the Infinitesimal Gradient Ascent (IGA) model [10], in which the agent uses knowledge of the current strategy pair (x, y) to make a small change in its strategy in the direction of the gradient of immediate payoff P (x, y). [sent-92, score-0.62]

55 Once again, this type of model reduces to an MDP with stationary history-independent transitions of effective state depending only on (s, y, x). [sent-93, score-0.22]

56 Note that the above claims of reduction to an MDP depend on the Hyper-Q learner being able to accurately estimate the opponent mixed strategy y. [sent-94, score-0.831]

57 Otherwise, the Hyper-Q learner would face a POMDP situation, and standard convergence proofs would not apply. [sent-95, score-0.111]

58 4 Opponent strategy estimation We now consider estimation of opponent strategies from the history of base actions. [sent-96, score-0.945]

59 , to consider a class of explicit dynamical models of opponent strategy, and choose the model that best fits the observed data. [sent-99, score-0.356]

60 There are two difficult aspects to this approach: (1) the class of possible dynamical models may need to be extraordinarily large; (2) there is a well-known danger of “infinite regress” of opponent models if A’s model of B attempts to take into account B’s model of A. [sent-100, score-0.387]

61 An alternative approach studied here is model-free strategy estimation. [sent-101, score-0.209]

62 This is in keeping with the spirit of Q-learning, which learns state valuations without explicitly modeling the dynamics of the underlying state transitions. [sent-102, score-0.228]

63 This maintains a moving average y of opponent strategy by updating after each observed action using: ¯ y (t + 1) = (1 − µ)¯(t) + µua (t) ¯ y (3) where ua (t) is a unit vector representation of the base action a. [sent-104, score-0.775]

64 EMA assumes only that recent observations are more informative than older observations, and should give accurate estimates when significant strategy changes take place on time scales > O(1/µ). [sent-105, score-0.209]

65 1 Bayesian strategy estimation A more principled model-free alternative to EMA is now presented. [sent-107, score-0.247]

66 A probability for each y given the history of observed actions H, P (y|H), can then be computed using Bayes’ rule as follows: P (y|H) = P (H|y)P (y) y P (H|y )P (y ) (4) where P (y) is the prior probability of state y, and the sum over y extends over all strategy grid points. [sent-111, score-0.531]

67 The conditional probability of the history given the strategy, P (H|y), can t now be decomposed into a product of individual action probabilities k=0 P (a(k)|y(t)) assuming conditional independence of the individual actions. [sent-112, score-0.121]

68 If all actions in the history are equally informative regardless of age, we may write P (a(k)|y(t)) = ya(k) (t) for all k. [sent-113, score-0.182]

69 However, it is again reasonable to assume that more recent actions are more informative. [sent-115, score-0.118]

70 Within a normalization factor, we then write: t P (H|y) = k=0 wk ya(k) (5) A linear schedule wk = 1 − µ(t − k) for the weights is intuitively obvious; truncation of the history at the most recent 1/µ observations ensures that all weights are positive. [sent-117, score-0.14]

71 A uniform grid discretization of size N = 25 is used to represent mixed-strategy component probabilities, giving a simplex grid of size N (N + 1)/2 = 325 for either player’s mixed strategy, and thus the entire Hyper-Q table is of size (325)2 = 105625. [sent-119, score-0.455]

72 1 Hyper-Q/Bayes formulation Three different opponent estimation schemes were used with Hyper-Q learning: (1) “Omniscient,” i. [sent-124, score-0.359]

73 Equations 1 and 2 were modified in the Bayesian case to allow for a distribution of opponent states y, with probabilities P (y|H). [sent-129, score-0.354]

74 2 Rock-Paper-Scissors results We first examine Hyper-Q training online against an IGA player. [sent-132, score-0.096]

75 Apart from possible state observability and discretization issues, Hyper-Q should in principle be able to converge against this type of opponent. [sent-133, score-0.203]

76 In order to conform to the original implicit assumptions underlying IGA, the IGA player is allowed to have omniscient knowledge of the Hyper-Q player’s mixed strategy at each time step. [sent-134, score-0.901]

77 Figure 1 shows a smoothed plot of the online Bellman error, and the Hyper-Q player’s average reward per time step, as a function of training time. [sent-136, score-0.328]

78 an IGA player in Rock-Paper-Scissors, using three different opponent state estimation methods: “Omniscient,” “EMA” and “Bayes” as indicated. [sent-164, score-0.683]

79 Right plot shows average Hyper-Q reward per time step. [sent-167, score-0.212]

80 2 0 10000 20000 Time Steps 30000 40000 Figure 2: Trajectory of the IGA mixed strategy against the Hyper-Q strategy starting from a single exploring start. [sent-179, score-0.709]

81 progress toward convergence, as suggested by substantially reduced Bellman error and substantial positive average reward per time step. [sent-181, score-0.197]

82 Bayes also has by far the worst average reward at the start of learning, but asymptotically it clearly outperforms EMA, and comes close to matching the performance obtained with omniscient knowledge of opponent state. [sent-184, score-0.675]

83 Afterwards, the Hyper-Q policy causes IGA to cycle erraticly between two different probabilites for Rock and two different probabilities for Paper, thus preventing IGA from reaching the Nash mixed strategy. [sent-188, score-0.347]

84 Right plot shows average Hyper-Q reward per time step. [sent-214, score-0.212]

85 PHC is a simple adaptive strategy based only on its own actions and rewards. [sent-217, score-0.352]

86 It maintains a Q-table of values for each of its base actions, and at every time step, it adjusts its mixed strategy by a small step towards the greedy policy of its current Q-function. [sent-218, score-0.645]

87 The PHC strategy is history-dependent, so that reduction to an MDP is not possible for the Hyper-Q learner. [sent-219, score-0.209]

88 Given that PHC ignores opponent state, it should be a weak competitive player, and in fact it does much worse in average reward than IGA. [sent-221, score-0.463]

89 It is also interesting to note that Bayesian estimation once again clearly outperforms EMA estimation, and surprisingly, it also outperforms omniscient state knowledge. [sent-222, score-0.339]

90 Left plot uses Omniscient state estimation; right plot uses Bayesian estimation. [sent-248, score-0.171]

91 The average reward plots are uninteresting: as one would expect, each player’s average reward is close to zero. [sent-251, score-0.284]

92 However, it is possible that the players’ greedy policies x(y) and y (x) simultaneously become stationary, ˆ ˆ thereby enabling them to optimize against each other. [sent-255, score-0.113]

93 In examining the actual play, it does not converge to the Nash point ( 1 , 1 , 1 ), but it does appear to cycle amongst a small 3 3 3 number of grid points with roughly zero average reward over the cycle for both players. [sent-256, score-0.284]

94 With grid discretization it scales badly but with other function approximators it may become practical. [sent-259, score-0.132]

95 Significant improvements in opponent state estimation should be easy to obtain. [sent-263, score-0.448]

96 Acknowledgements: The author thanks Michael Littman for many helpful discussions; Irina Rish for insights into Bayesian state estimation; and Michael Bowling for assistance in implementing the PHC algorithm. [sent-267, score-0.089]

97 Efficient Nash computation in large population games with bounded influence. [sent-302, score-0.106]

98 Markov games as a framework for multi-agent reinforcement learning. [sent-307, score-0.131]

99 A convergent reinforcement learning algorithm in the continuous case based on a finite difference method. [sent-318, score-0.206]

100 Tree based discretization for continuous state space reinforcement learning. [sent-341, score-0.268]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('opponent', 0.321), ('iga', 0.318), ('phc', 0.254), ('mixed', 0.245), ('player', 0.235), ('ema', 0.233), ('agents', 0.224), ('omniscient', 0.212), ('strategies', 0.211), ('strategy', 0.209), ('bellman', 0.154), ('mdp', 0.134), ('game', 0.131), ('payoffs', 0.129), ('agent', 0.121), ('actions', 0.118), ('reward', 0.112), ('opponents', 0.106), ('nash', 0.101), ('state', 0.089), ('infinitely', 0.085), ('exploration', 0.082), ('discretization', 0.081), ('policy', 0.073), ('online', 0.07), ('finite', 0.07), ('games', 0.067), ('reinforcement', 0.064), ('base', 0.064), ('history', 0.064), ('bowling', 0.064), ('infinitesimal', 0.064), ('bayes', 0.062), ('policies', 0.059), ('action', 0.057), ('stationary', 0.057), ('learner', 0.056), ('payoff', 0.056), ('convergence', 0.055), ('greedy', 0.054), ('grid', 0.051), ('dynamics', 0.05), ('multiagent', 0.047), ('smoothed', 0.046), ('exploring', 0.046), ('players', 0.044), ('observes', 0.044), ('bayesian', 0.044), ('yt', 0.042), ('vanishingly', 0.042), ('plot', 0.041), ('morgan', 0.04), ('play', 0.04), ('population', 0.039), ('effective', 0.039), ('wk', 0.038), ('convergent', 0.038), ('estimation', 0.038), ('stochastic', 0.037), ('littman', 0.037), ('ua', 0.037), ('ya', 0.037), ('dynamical', 0.035), ('transitions', 0.035), ('extensions', 0.034), ('proceedings', 0.034), ('continuous', 0.034), ('visited', 0.034), ('ascent', 0.034), ('kaufman', 0.034), ('qlearning', 0.034), ('converge', 0.033), ('states', 0.033), ('normal', 0.032), ('ni', 0.032), ('evolutionary', 0.031), ('difficult', 0.031), ('cumulative', 0.031), ('average', 0.03), ('sufficient', 0.029), ('legal', 0.029), ('tesauro', 0.029), ('cycle', 0.029), ('per', 0.029), ('broad', 0.028), ('hu', 0.028), ('significantly', 0.028), ('ibm', 0.028), ('hill', 0.028), ('uniform', 0.027), ('transient', 0.027), ('cycling', 0.027), ('environment', 0.026), ('examine', 0.026), ('kaufmann', 0.026), ('substantially', 0.026), ('preliminary', 0.026), ('considerations', 0.026), ('adaptive', 0.025), ('gradient', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

Author: Gerald Tesauro

2 0.30754977 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

Author: Daniela Pucci de Farias, Nimrod Megiddo

Abstract: The so-called “experts algorithms” constitute a methodology for choosing actions repeatedly, when the rewards depend both on the choice of action and on the unknown current state of the environment. An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any ﬁxed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. This methodology may not be suitable for situations where the evolution of states of the environment depends on past chosen actions, as is usually the case, for example, in a repeated non-zero-sum game. A new experts algorithm is presented and analyzed in the context of repeated games. It is shown that asymptotically, under certain conditions, it performs as well as the best available expert. This algorithm is quite different from previously proposed experts algorithms. It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player’s actions on the opponent’s actions or the environment. The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner’s Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. 1

3 0.29821751 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efﬁcient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions). 1

4 0.27147353 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efﬁcient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman ﬁltering to allow an agent to construct a good training signal and learn an effective policy. 1

5 0.20747018 19 nips-2003-Algorithms for Interdependent Security Games

Author: Michael Kearns, Luis E. Ortiz

Abstract: unkown-abstract

6 0.164682 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

7 0.147773 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

8 0.14029106 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

9 0.13961048 78 nips-2003-Gaussian Processes in Reinforcement Learning

10 0.1035067 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

11 0.10212676 68 nips-2003-Eye Movements for Reward Maximization

12 0.095042326 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

13 0.094396912 158 nips-2003-Policy Search by Dynamic Programming

14 0.094225936 62 nips-2003-Envelope-based Planning in Relational MDPs

15 0.091768242 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

16 0.091140643 55 nips-2003-Distributed Optimization in Adaptive Networks

17 0.083363347 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

18 0.081841357 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

19 0.073727295 51 nips-2003-Design of Experiments via Information Theory

20 0.068837434 102 nips-2003-Large Scale Online Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.206), (1, 0.385), (2, -0.156), (3, -0.079), (4, 0.008), (5, 0.069), (6, -0.071), (7, 0.084), (8, -0.282), (9, -0.127), (10, 0.051), (11, 0.125), (12, -0.162), (13, -0.048), (14, -0.024), (15, -0.131), (16, 0.007), (17, 0.076), (18, 0.083), (19, -0.018), (20, 0.019), (21, -0.11), (22, -0.095), (23, -0.042), (24, 0.049), (25, -0.053), (26, -0.038), (27, -0.077), (28, 0.017), (29, -0.071), (30, -0.105), (31, -0.071), (32, -0.0), (33, -0.032), (34, 0.045), (35, 0.06), (36, 0.043), (37, 0.01), (38, -0.004), (39, -0.026), (40, 0.063), (41, -0.014), (42, -0.008), (43, 0.023), (44, -0.063), (45, 0.054), (46, 0.022), (47, -0.064), (48, 0.026), (49, -0.122)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95563531 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

Author: Gerald Tesauro

2 0.81634218 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

Author: Xiaofeng Wang, Tuomas Sandholm

3 0.79666644 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

Author: Daniela Pucci de Farias, Nimrod Megiddo

4 0.78648704 19 nips-2003-Algorithms for Interdependent Security Games

Author: Michael Kearns, Luis E. Ortiz

Abstract: unkown-abstract

5 0.59398973 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

6 0.50671738 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

7 0.42495361 68 nips-2003-Eye Movements for Reward Maximization

8 0.38952932 44 nips-2003-Can We Learn to Beat the Best Stock

9 0.35917714 62 nips-2003-Envelope-based Planning in Relational MDPs

10 0.35305101 146 nips-2003-Online Learning of Non-stationary Sequences

11 0.34129769 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

12 0.3384898 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

13 0.33800322 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

14 0.32498363 78 nips-2003-Gaussian Processes in Reinforcement Learning

15 0.31808776 165 nips-2003-Reasoning about Time and Knowledge in Neural Symbolic Learning Systems

16 0.31209424 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

17 0.31078687 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

18 0.29347414 158 nips-2003-Policy Search by Dynamic Programming

19 0.29211691 55 nips-2003-Distributed Optimization in Adaptive Networks

20 0.29147354 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.046), (11, 0.018), (29, 0.019), (30, 0.026), (35, 0.055), (53, 0.088), (64, 0.264), (69, 0.02), (71, 0.046), (76, 0.041), (85, 0.056), (91, 0.204), (99, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90673709 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

Author: Saori C. Tanaka, Kenji Doya, Go Okada, Kazutaka Ueda, Yasumasa Okamoto, Shigeto Yamawaki

Abstract: To understand the brain mechanisms involved in reward prediction on different time scales, we developed a Markov decision task that requires prediction of both immediate and future rewards, and analyzed subjects’ brain activities using functional MRI. We estimated the time course of reward prediction and reward prediction error on different time scales from subjects' performance data, and used them as the explanatory variables for SPM analysis. We found topographic maps of different time scales in medial frontal cortex and striatum. The result suggests that different cortico-basal ganglia loops are specialized for reward prediction on different time scales. 1 Intro du ction In our daily life, we make decisions based on the prediction of rewards on different time scales; immediate and long-term effects of an action are often in conflict, and biased evaluation of immediate or future outcome can lead to pathetic behaviors. Lesions in the central serotonergic system result in impulsive behaviors in humans [1], and animals [2, 3], which can be attributed to deficits in reward prediction on a long time scale. Damages in the ventral part of medial frontal cortex (MFC) also cause deficits in decision-making that requires assessment of future outcomes [4-6]. A possible mechanism underlying these observations is that different brain areas are specialized for reward prediction on different time scales, and that the ascending serotonergic system activates those specialized for predictions in longer time scales [7]. The theoretical framework of temporal difference (TD) learning [8] successfully explains reward-predictive activities of the midbrain dopaminergic system as well as those of the cortex and the striatum [9-13]. In TD learning theory, the predicted amount of future reward starting from a state s(t) is formulated as the “value function” V(t) = E[r(t + 1) + γ r(t + 2) + γ 2r(t + 3) + …] (1) and learning is based on the TD error δ(t) = r(t) + γ V(t) – V(t - 1). (2) The ‘discount factor’ γ controls the time scale of prediction; while only the immediate reward r(t + 1) is considered with γ = 0, rewards in the longer future are taken into account with γ closer to 1. In order to test the above hypothesis [7], we developed a reinforcement learning task which requires a large value of discount factor for successful performance, and analyzed subjects’ brain activities using functional MRI. In addition to conventional block-design analysis, a novel model-based regression analysis revealed topographic representation of prediction time scale with in the cortico-basal ganglia loops. 2 2.1 Methods Markov Decision Task In the Markov decision task (Fig. 1), markers on the corners of a square present four states, and the subject selects one of two actions by pressing a button (a1 = left button, a2 = right button) (Fig. 1A). The action determines both the amount of reward and the movement of the marker (Fig. 1B). In the REGULAR condition, the next trial is started from the marker position at the end of the previous trial. Therefore, in order to maximize the reward acquired in a long run, the subject has to select an action by taking into account both the immediate reward and the future reward expected from the subsequent state. The optimal behavior is to receive small negative rewards at states s 2, s3, and s4 to obtain a large positive reward at state s1 (Fig. 1C). In the RANDOM condition, next trial is started from a random marker position so that the subject has to consider only immediate reward. Thus, the optimal behavior is to collect a larger reward at each state (Fig. 1D). In the baseline condition (NO condition), the reward is always zero. In order to learn the optimal behaviors, the discount factor γ has to be larger than 0.3425 in REGULAR condition, while it can be arbitrarily small in RANDOM condition. 2.2 fMRI imaging Eighteen healthy, right-handed volunteers (13 males and 5 females), gave informed consent to take part in the study, with the approval of the ethics and safety committees of ATR and Hiroshima University. A 0 Time 1.0 2.0 2.5 3.0 100 C B +r 2 s2 s1 REGULAR condition s2 -r 1 -r 2 +r 1 s1 100 D RANDOM condition +r 2 s2 s1 -r 1 +r 1 -r 2 -r 1 s4 +r 2 4.0 (s) -r 1 s3 a1 a2 r1 = 20 10 yen r2 = 100 10 yen +r 1 -r 1 s4 -r 1 -r 1 s3 s4 -r 1 s3 Fig. 1. (A) Sequence of stimulus and response events in the Markov decision task. First, one of four squares representing present state turns green (0s). As the fixation point turns green (1s), the subject presses either the right or left button within 1 second. After 1s delay, the green square changes its position (2s), and then a reward for the current action is presented by a number (2.5s) and a bar graph showing cumulative reward during the block is updated (3.0s). One trial takes four seconds. Subjects performed five trials in the NO condition, 32 trials in the RANDOM condition, five trials in the NO condition, and 32 trials in the REGULAR condition in one block. They repeated four blocks; thus, the entire experiment consisted of 312 trials, taking about 20 minutes. (B) The rule of the reward and marker movement. (C) In the REGULAR condition, the optimal behavior is to receive small negative rewards –r 1 (-10, -20, or -30 yen) at states s2, s3, and s4 to obtain a large positive reward +r2 (90, 100, or 110 yen) at state s1. (D) In the RANDOM condition, the next trial is started from random state. Thus, the optimal behavior is to select a larger reward at each state. A 1.5-Tesla scanner (Marconi, MAGNEX ECLIPSE, Japan) was used to acquire both structural T1-weighted images (TR = 12 s, TE = 450 ms, flip angle = 20 deg, matrix = 256 × 256, FoV = 256 mm, thickness = 1 mm, slice gap = 0 mm ) and T2*-weighted echo planar images (TR = 4 s, TE = 55 msec, flip angle = 90 deg, 38 transverse slices, matrix = 64 × 64, FoV = 192 mm, thickness = 4 mm, slice gap = 0 mm, slice gap = 0 mm) with blood oxygen level-dependent (BOLD) contrast. 2.3 Data analysis The data were preprocessed and analyzed with SPM99 (Friston et al., 1995; Wellcome Department of Cognitive Neurology, London, UK). The first three volumes of images were discarded to avoid T1 equilibrium effects. The images were realigned to the first image as a reference, spatially normalized with respect to the Montreal Neurological Institute EPI template, and spatially smoothed with a Gaussian kernel (8 mm, full-width at half-maximum). A RANDOM condition action larger reward Fig. 2. The selected action of a representative single subject (solid line) and the group average ratio of selecting optimal action (dashed line) in (A) RANDOM and (B) REGULAR conditions. smaller reward 1 32 64 96 128 96 128 trial REGULAR condition B action optimal nonoptimal 1 32 64 trial Images of parameter estimates for the contrast of interest were created for each subject. These were then used for a second-level group analysis using a one-sample t-test across the subjects (random effects analysis). We conducted two types of analysis. One was block design analysis using three boxcar regressors convolved with a hemodynamic response function as the reference waveform for each condition (RANDOM, REGULAR, and NO). The other was multivariate regression analysis using explanatory variables, representing the time course of the reward prediction V(t) and reward prediction error δ(t) estimated from subjects’ performance data (described below), in addition to three regressors representing the condition of the block. 2.4 Estimation of predicted reward V(t) and prediction error δ(t) The time course of reward prediction V(t) and reward prediction error δ(t) were estimated from each subject’s performance data, i.e. state s(t), action a(t), and reward r(t), as follows. If the subject starts from a state s(t) and comes back to the same state after k steps, the expected cumulative reward V(t) should satisfy the consistency condition V(t) = r(t + 1) + γ r(t + 2) + … + γ k-1 r(t + k) + γ kV(t). (3) Thus, for each time t of the data file, we calculated the weighted sum of the rewards acquired until the subject returned to the same state and estimated the value function for that episode as  r ( t + 1) + γ r ( t + 2 ) + ... + γ k −1r ( t + k )  . ˆ (t ) =  V 1− γ k (4) The estimate of the value function V(t) at time t was given by the average of all previous episodes from the same state as at time t V (t ) = 1 L L ∑ Vˆ ( t ) , l (5) l =1 where {t1, …, tL} are the indices of time visiting the same state as s(t), i.e. s(t1) = … = s(tL) = s(t). The TD error was given by the difference between the actual reward r(t) and the temporal difference of the value function V(t) according to equation (2). Assuming that different brain areas are involved in reward prediction on different time scales, we varied the discount factor γ as 0, 0.3, 0.6, 0.8, 0.9, and 0.99. Fig. 3. (A) In REGULAR vs. RANDOM comparison, significant activation was observed in DLPFC ((x, y, z) = (46, 45, 9), peak t = 4.06) (p < 0.001 uncorrected). (B) In RANDOM vs. REGULAR comparison, significant activation was observed in lateral OFC ((x, y, z) = (-32, 9, -21), peak t = 4.90) (p < 0.001 uncorrected). 3 3.1 R e sul t s Behavioral results Figure 2 summarizes the learning performance of a representative single subject (solid line) and group average (dashed line) during fMRI measurement. Fourteen subjects successfully learned to take larger immediate rewards in the RANDOM condition (Fig. 2A) and a large positive reward at s1 after small negative rewards at s2, s3 and s4 in the REGULAR condition (Fig. 2B). 3.2 Block-design analysis In REGULAR vs. RANDOM contrast, we observed a significant activation in the dorsolateral prefrontal cortex (DLPFC) (Fig. 3A) (p < 0.001 uncorrected). In RANDOM vs. REGULAR contrast, we observed a significant activation in lateral orbitofrontal cortex (lOFC) (Fig. 3B) (p < 0.001 uncorrected). The result of block-design analysis suggests differential involvement of neural pathways in reward prediction on long and short time scales. The result in RANDOM vs. REGULAR contrast was consistent with previous studies that the OFC is involved in reward prediction within a short delay and reward outcome [14-20]. 3.3 Regression analysis We observed significant correlation with reward prediction V(t) in the MFC, DLPFC (all γ ), ventromedial insula (small γ ), dorsal striatum, amygdala, hippocampus, and parahippocampal gyrus (large γ ) (p < 0.001 uncorrected) (Fig. 4A). We also found significant correlation with reward prediction error δ(t) in the IPC, PMd, cerebellum (all γ ), ventral striatum (small γ ), and lateral OFC (large γ ) (p < 0.001 uncorrected) (Fig. 4B). As we changed the time scale parameter γ of reward prediction, we found rostro-caudal maps of correlation to V(t) in MFC with increasing γ. Fig. 4. Voxels with a significant correlation (p < 0.001 uncorrected) with reward prediction V(t) and prediction error δ(t) are shown in different colors for different settings of the time scale parameter (γ = 0 in red, γ = 0.3 in orange, γ = 0.6 in yellow, γ = 0.8 in green, γ = 0.9 in cyan, and γ = 0.99 in blue). Voxels correlated with two or more regressors are shown by a mosaic of colors. (A) Significant correlation with reward prediction V(t) was observed in the MFC, DLPFC, dorsal striatum, insula, and hippocampus. Note the anterior-ventral to posterior-dorsal gradient with the increase in γ in the MFC. (B) Significant correlation with reward prediction error δ(t) on γ = 0 was observed in the ventral striatum. 4 D i s c u ss i o n In the MFC, anterior and ventral part was involved in reward prediction V(t) on shorter time scales (0 ≤ γ ≤ 0.6), whereas posterior and dorsal part was involved in reward prediction V(t) on longer time scales (0.6 ≤ γ ≤ 0.99). The ventral striatum involved in reward prediction error δ(t) on shortest time scale (γ = 0), while the dorsolateral striatum correlated with reward prediction V(t) on longer time scales (0.9 ≤ γ ≤ 0.99). These results are consistent with the topographic organization of fronto-striatal connection; the rostral part of the MFC project to the ventral striatum, whereas the dorsal and posterior part of the cingulate cortex project to the dorsolateral striatum [21]. In the MFC and the striatum, no significant difference in activity was observed in block-design analysis while we did find graded maps of activities with different values of γ. A possible reason is that different parts of the MFC and the striatum are concurrently involved with reward prediction on different time scales, regardless of the task context. Activities of the DLPFC and lOFC, which show significant differences in block-design analysis (Fig. 3), may be regulated according to the necessity for the task; From these results, we propose the following mechanism of reward prediction on different time scales. The parallel cortico-basal ganglia loops are responsible for reward prediction on various time scales. The ‘limbic loop’ via the ventral striatum specializes in immediate reward prediction, whereas the ‘cognitive and motor loop’ via the dorsal striatum specialises in future reward prediction. Each loop learns to predict rewards on its specific time scale. To perform an optimal action under a given time scale, the output of the loop with an appropriate time scale is used for actual action selection. Previous studies in brain damages and serotonergic functions suggest that the MFC and the dorsal raphe, which are reciprocally connected [22, 23], play an important role in future reward prediction. The cortico-cortico projections from the MFC, or the serotonergic projections from the dorsal raphe to the cortex and the striatum may be involved in the modulation of these parallel loops. In present study, using a novel regression analysis based on subjects’ performance data and reinforcement learning model, we revealed the maps of time scales in reward prediction, which could not be found by conventional block-design analysis. Future studies using this method under pharmacological manipulation of the serotonergic system would clarify the role of serotonin in regulating the time scale of reward prediction. Acknowledgments We thank Nicolas Schweighofer, Kazuyuki Samejima, Masahiko Haruno, Hiroshi Imamizu, Satomi Higuchi, Toshinori Yoshioka, and Mitsuo Kawato for helpful discussions and technical advice. References [1] Rogers, R.D., et al. (1999) Dissociable deficits in the decision-making cognition of chronic amphetamine abusers, opiate abusers, patients with focal damage to prefrontal cortex, and tryptophan-depleted normal volunteers: evidence for monoaminergic mechanisms. Neuropsychopharmacology 20(4):322-339. [2] Evenden, J.L. & Ryan, C.N. (1996) The pharmacology of impulsive behaviour in rats: the effects of drugs on response choice with varying delays of reinforcement. Psychopharmacology (Berl) 128(2):161-170. [3] Mobini, S., et al. (2000) Effects of central 5-hydroxytryptamine depletion on sensitivity to delayed and probabilistic reinforcement. Psychopharmacology (Berl) 152(4):390-397. [4] Bechara, A., et al. (1994) Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50(1-3):7-15. [5] Bechara, A., Tranel, D. & Damasio, H. (2000) Characterization of the decision-making deficit of patients with ventromedial prefrontal cortex lesions. Brain 123:2189-2202. [6] Mobini, S., et al. (2002) Effects of lesions of the orbitofrontal cortex on sensitivity to delayed and probabilistic reinforcement. Psychopharmacology (Berl) 160(3):290-298. [7] Doya, K. (2002) 15(4-6):495-506. Metalearning and neuromodulation. Neural Netw [8] Sutton, R.S., Barto, A. G. (1998) Reinforcement learning. Cambridge, MA: MIT press. [9] Houk, J.C., Adams, J.L. & Barto, A.G., A model of how the basal ganglia generate and use neural signals that predict reinforcement, in Models of information processing in the basal ganglia, J.C. Houk, J.L. Davis, and D.G. Beiser, Editors. 1995, MIT Press: Cambridge, Mass. p. 249-270. [10] Schultz, W., Dayan, P. & Montague, P.R. (1997) A neural substrate of prediction and reward. Science 275(5306):1593-1599. [11] Doya, K. (2000) Complementary roles of basal ganglia and cerebellum in learning and motor control. Curr Opin Neurobiol 10(6):732-739. [12] Berns, G.S., et al. (2001) Predictability modulates human brain response to reward. J Neurosci 21(8):2793-2798. [13] O'Doherty, J.P., et al. (2003) Temporal difference models and reward-related learning in the human brain. Neuron 38(2):329-337. [14] Koepp, M.J., et al. (1998) Evidence for striatal dopamine release during a video game. Nature 393(6682):266-268. [15] Rogers, R.D., et al. (1999) Choosing between small, likely rewards and large, unlikely rewards activates inferior and orbital prefrontal cortex. J Neurosci 19(20):9029-9038. [16] Elliott, R., Friston, K.J. & Dolan, R.J. (2000) Dissociable neural responses in human reward systems. J Neurosci 20(16):6159-6165. [17] Breiter, H.C., et al. (2001) Functional imaging of neural responses to expectancy and experience of monetary gains and losses. Neuron 30(2):619-639. [18] Knutson, B., et al. (2001) Anticipation of increasing monetary reward selectively recruits nucleus accumbens. J Neurosci 21(16):RC159. [19] O'Doherty, J.P., et al. (2002) Neural responses during anticipation of a primary taste reward. Neuron 33(5):815-826. [20] Pagnoni, G., et al. (2002) Activity in human ventral striatum locked to errors of reward prediction. Nat Neurosci 5(2):97-98. [21] Haber, S.N., et al. (1995) The orbital and medial prefrontal circuit through the primate basal ganglia. J Neurosci 15(7 Pt 1):4851-4867. [22] Celada, P., et al. (2001) Control of dorsal raphe serotonergic neurons by the medial prefrontal cortex: Involvement of serotonin-1A, GABA(A), and glutamate receptors. J Neurosci 21(24):9917-9929. [23] Martin-Ruiz, R., et al. (2001) Control of serotonergic function in medial prefrontal cortex by serotonin-2A receptors through a glutamate-dependent mechanism. J Neurosci 21(24):9856-9866.

same-paper 2 0.83244365 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

Author: Gerald Tesauro

3 0.67120564 46 nips-2003-Clustering with the Connectivity Kernel

Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann

Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of ﬁnding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: ﬁrst the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1

4 0.66138452 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

Author: Marc Toussaint

Abstract: We present a connectionist architecture that can learn a model of the relations between perceptions and actions and use this model for behavior planning. State representations are learned with a growing selforganizing layer which is directly coupled to a perception and a motor layer. Knowledge about possible state transitions is encoded in the lateral connectivity. Motor signals modulate this lateral connectivity and a dynamic ﬁeld on the layer organizes a planning process. All mechanisms are local and adaptation is based on Hebbian ideas. The model is continuous in the action, perception, and time domain.

5 0.66103816 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan

Abstract: Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face ﬁnder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM’s. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classiﬁer change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the ﬁeld in an application for automatic assessment of human-robot interaction.

6 0.66047806 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

7 0.65537524 87 nips-2003-Identifying Structure across Pre-partitioned Data

8 0.64635718 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

9 0.64333773 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors

10 0.6374985 68 nips-2003-Eye Movements for Reward Maximization

11 0.62840635 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

12 0.62825495 107 nips-2003-Learning Spectral Clustering

13 0.62781972 55 nips-2003-Distributed Optimization in Adaptive Networks

14 0.62750548 73 nips-2003-Feature Selection in Clustering Problems

15 0.62213194 81 nips-2003-Geometric Analysis of Constrained Curves

16 0.62169391 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

17 0.62011921 78 nips-2003-Gaussian Processes in Reinforcement Learning

18 0.61557209 43 nips-2003-Bounded Invariance and the Formation of Place Fields

19 0.61389804 30 nips-2003-Approximability of Probability Distributions

20 0.61374897 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model