nips nips2003 nips2003-20 knowledge-graph by maker-knowledge-mining

20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Source: pdf

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efﬁcient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman ﬁltering to allow an agent to construct a good training signal and learn an effective policy. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 All learning is local: Multi-agent learning in global reward games Yu-Han Chang MIT CSAIL Cambridge, MA 02139 ychang@csail. [sent-1, score-0.63]

2 We provide a simple and efﬁcient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman ﬁltering to allow an agent to construct a good training signal and learn an effective policy. [sent-7, score-0.801]

3 We take a different approach in this paper, presenting a simplifying abstraction and a reward ﬁltering technique that allows computationally efﬁcient and robust learning in large multi-agent environments where other methods may fail or become intractable. [sent-11, score-0.537]

4 In many multi-agent settings, our learning agent does not have a full view of the world. [sent-12, score-0.484]

5 Other agents may be far away or otherwise obscured. [sent-13, score-0.343]

6 At the very least, our learning agent usually does not have a a complete representation of the internal states of the other agents. [sent-14, score-0.521]

7 This partial observability creates problems when the agent begins to learn about the world, since it cannot see how the other agents are manipulating the environment and thus it cannot ascertain the true world state. [sent-15, score-1.191]

8 A separate problem arises when we train multiple agents using a global reward signal. [sent-17, score-0.892]

9 This is often the case in cooperative games in which all the agents contribute towards attaining some common goal. [sent-18, score-0.396]

10 Even with full observability, the agents would need to overcome a credit assignment problem, since it may be difﬁcult to ascertain which agents were responsible for creating good reward signals. [sent-19, score-1.259]

11 If we cannot even observe what the other agents are doing, how can we begin to reason about their role in obtaining the current reward? [sent-20, score-0.365]

12 Consider an agent in an MDP, learning to maximize a reward that is a function of its observable state and/or actions. [sent-21, score-1.078]

13 The effects of non-stationarity, partial observability, and global rewards can be thought of as replacing the true reward signal with an alternate signal that is a non-stationary function of the original reward. [sent-23, score-1.058]

14 This causes problems for an agent that is trying to use the collective “global” reward signal to learn an optimal policy. [sent-25, score-1.203]

15 Ideally the agent can recover the original “personal reward signal” and learn using that signal rather than the global reward signal. [sent-26, score-1.671]

16 Many external sources of reward (which could be regarded as noise) can be modeled as or approximated by a random Markov process, so this technique promises broad applicability. [sent-29, score-0.528]

17 This approach is more robust than trying to learn directly from the global reward, allowing agents to learn and converge faster to an optimal or near-optimal policy. [sent-30, score-0.629]

18 When the environment switches between modes, all the rewards may be altered. [sent-37, score-0.279]

19 For variation produced by the actions of other agents in the world, or for truly unobservable environmental changes, this technique would not work as well. [sent-39, score-0.444]

20 Rather than ﬁltering the rewards as we will do, Ng et al. [sent-43, score-0.28]

21 [1999] show that a potential function can be used to shape the rewards without affecting the learned policy while possibly speeding up convergence. [sent-44, score-0.424]

22 [2003] discuss learning in the scenario in which the opponent gets to choose the agent’s reward function. [sent-50, score-0.508]

23 The innovative aspect of our approach is to consider the reward signal as merely a signal that is correlated with our true learning signal. [sent-51, score-0.75]

24 We propose a model that captures the relationship between the true reward and the noisy rewards in a wide range of problems. [sent-52, score-0.774]

25 Thus, without assuming much additional domain knowledge, we can use ﬁltering methods to recover the underlying true reward signal from the noisy observed global rewards. [sent-53, score-0.796]

26 3 Mathematical model The agent assumes that the world possesses one or more unobservable state variables that affect the global reward signal. [sent-54, score-1.255]

27 These unobservable states may include the presence of other agents or changes in the environment. [sent-55, score-0.429]

28 Each agent models the effect of these unobservable state variables on the global reward as an additive noise process bt that evolves according to bt+1 = bt + zt , where zt is a zero-mean Gaussian random variable with variance σw . [sent-56, score-1.527]

29 The global reward that it observes if it is in state i at time t is gt = r(i) + bt , where r is a vector containing the ideal training rewards r(i) received by the agent at state i. [sent-57, score-1.641]

30 The standard model that describes such a linear system is: gt = Cxt + vt , xt = Axt−1 + wt , vt ∼ N (0, Σ2 ) wt ∼ N (0, Σ1 ) T In our case, we desire estimates of xt = [rt bt ]T . [sent-58, score-0.391]

31 In our case, we set Σ2 = 0 since we assume no observation noise when we experience rewards; Σ1 (j, j) = 0, j = |S| + 1, since the rewards are ﬁxed and do not evolve over time; Σ1 (|S| + 1, |S| + 1) = σw since the noise term evolves with variance σw . [sent-60, score-0.546]

32 The agent applies the following causal Kalman ﬁltering equations at each time step to obtain maximum likelihood estimates for b and the individual rewards r(i) for each state i given all previous observations. [sent-69, score-0.852]

33 For the learning mechanism, we use a simple tabular Q-learning algorithm [Sutton and Barto, 1999], since we wish to focus our attention on the reward signal problem. [sent-76, score-0.616]

34 Actions move the agent N,S,E, or W, except in states 6 and 16, where any action takes the agent to state 10 and 18, respectively, shown by the curved arrows in the ﬁgure at left. [sent-90, score-1.123]

35 The optimal policy is shown at center, where multiple arrows at one state denotes indifference between the possibilities. [sent-91, score-0.271]

36 A policy learned by our ﬁltering agent is shown at right. [sent-92, score-0.641]

37 4 The ﬁltering learning agent Like any good student, the ﬁltering learning agent chooses to accept well-deserved praise from its teacher and ignore over-effusive rewards. [sent-93, score-0.993]

38 The question remains: How does an agent decide upon the relevance of the rewards it sees? [sent-95, score-0.699]

39 Using observations from previous states and actions, an agent can approach this question from two perspectives. [sent-97, score-0.495]

40 In the ﬁrst, each time the agent visits a particular state i, it should gain a better sense of the evolution of the random variable b between its last visit and its current visit. [sent-98, score-0.62]

41 Secondly, given an estimate of bt upon visiting state i at time t, it has a better idea of the value of bt+1 when it visits state i at time t + 1, since we assume bt evolves slowly over time. [sent-100, score-0.546]

42 From initial state i0 , take some action a, transition to state i, and receive reward signal g0 . [sent-103, score-0.792]

43 Perform a Kalman update using equations 1-5 to compute the current vector of estimates x, which includes a component that is the reward estimate r(i0 ), which ˆ ˆ will simply equal g this time. [sent-106, score-0.558]

44 From the current state i at time t, take another action with some mix of exploration and exploitation; transition to state j, receiving reward signal gt . [sent-108, score-0.894]

45 Perform a Kalman update using equations 1-5 to compute the current vector of estimates x, which includes a component that is the reward estimate r(i). [sent-111, score-0.558]

46 8 2 4 x 10 Figure 2: (Left) As the agent is attempting to learn, the reward signal value (y-axis) changes dramatically over time (x-axis) due to the noise term. [sent-129, score-1.159]

47 While the true range of rewards in this grid world domain only falls between 0 and 20, the noisy reward signal ranges from -10 to 250, as shown in the graph at left. [sent-130, score-1.185]

48 (Center) Given this noisy signal, the ﬁltering agent is still able to learn the true underlying rewards, converging to the correct relative values over time, as shown in the middle graph. [sent-131, score-0.615]

49 (Right) The ﬁltering learning agent (bold line) accrues higher rewards over time than the ordinary Q-learner (thin line), since it is able to converge to an optimal policy whereas the non-ﬁltering Q-learner remains confused. [sent-132, score-1.018]

50 However, since our model assumes the agent only has access to a limited, local observation space within the true global state space, this computation remains feasible. [sent-135, score-0.661]

51 5 Empirical results If the world dynamics exactly match the linear model we provide the Kalman ﬁlter, then this method will provably converge to the correct reward value estimates and the ﬁnd the optimal policy under conditions similar to those guaranteeing Q-learning’s eventual convergence. [sent-136, score-0.891]

52 The interesting question concerns situations in which the actual dynamics are clearly different from our model, and whether our ﬁltering agent will still learn a good policy. [sent-138, score-0.59]

53 The agent is able to move North, South, East, or West, and most transitions give the agent zero reward, except all actions from state 6 move the agent directly to state 10 with a reward of 20, and all actions from state 16 move the agent directly to state 18 with a reward of 10. [sent-141, score-3.413]

54 Bumps into the wall cost the agent -1 in reward and move the agent nowhere. [sent-142, score-1.453]

55 That is, in each time step, a single agent receives its true reward plus some noise term that evolves as a Markov random process. [sent-146, score-1.163]

56 To achieve this, we simply add a noise term to the grid world domain given in Figure 1. [sent-147, score-0.384]

57 As shown in Figure 2, an agent acting in this domain will receive a large range of reward values due to the evolving noise term. [sent-148, score-1.114]

58 In the example given, sometimes this value ranges as high as 250 even though the maximum reward in the grid world is 4 4 3. [sent-149, score-0.67]

59 5 0 1 2 3 4 5 6 7 8 9 10 4 x 10 Figure 3: (Left) Filtering agents are able to distinguish their personal rewards from the global reward noise, and thus able to learn optimal policies and maximize their average reward over time in a ten-agent grid-world domain. [sent-159, score-1.81]

60 (Right) In contrast, ordinary Q-learning agents do not process the global reward signal and can become confused as the environment changes around them. [sent-160, score-1.083]

61 Graphs show average rewards (y-axis) within 1000-period windows for each of the 10 agents in a typical run of 10000 time periods (x-axis). [sent-161, score-0.609]

62 20 – the noise term contributes 230 to the reward signal! [sent-162, score-0.59]

63 A standard Q-learning agent does not stand a chance at learning anything useful using this reward signal. [sent-163, score-0.966]

64 However, the ﬁltering agent can recover the true reward signal from this noisy signal and use that to learn. [sent-164, score-1.207]

65 Figure 2 shows that the ﬁltering agent can learn the underlying reward signals, converging to these values relatively quickly. [sent-165, score-1.046]

66 The observant reader may note that the learned rewards do not match the true rewards speciﬁed by the grid world. [sent-167, score-0.615]

67 Instead of mostly 0 rewards at each state, the agent has concluded that most states produce reward of -4. [sent-169, score-1.218]

68 Correspondingly, state 6 now produces a reward of about 16 instead of 20. [sent-170, score-0.569]

69 Depending on the initial guesses of our algorithm, the estimates for the rewards may be biased. [sent-173, score-0.314]

70 If most of the initial guesses for the rewards underestimated the true reward, then the learned value will be correspondingly lower than the actual true value. [sent-174, score-0.395]

71 To further test our ﬁltering technique, we next evaluate its performance in a domain that does not conform to our noise model perfectly, but which is still a single agent system. [sent-176, score-0.632]

72 Instead of an external reward term that evolves according to a Gaussian noise process, we adjust the noise manually, introducing positive and negative swings in the reward signal values at arbitrary times. [sent-177, score-1.352]

73 The most interesting case occurs when the domain noise is actually caused by other agents learning in the environment. [sent-179, score-0.543]

74 If there are enough other agents in the world, then the noise they collectively generate may actually tend towards Gaussian noise. [sent-181, score-0.429]

75 Here we focus on smaller cases where there are 6 or 10 agents operating in the environment. [sent-182, score-0.343]

76 We modify the grid world domain to include multiple simultaneously-acting agents, whose actions do not interfere with each other, but whose reward signal now consists of the sum of all the agents’ personal rewards, as given in the basic single agent grid world of Figure 1. [sent-183, score-1.628]

77 (Right) Graph shows average rewards (y-axis) in 1000-period windows as ﬁltering (bold line) and ordinary (thin line) agents try to learn good policies for acting as network nodes. [sent-189, score-0.728]

78 The ﬁltering agent is able to learn a better policy, resulting in higher network performance (global reward). [sent-190, score-0.532]

79 Graph shows the average for each type of agent over 10 trial runs of 100000 time periods (x-axis) each. [sent-191, score-0.483]

80 Three of the 10 agents converge to a suboptimal policy that produces slightly lower average rewards. [sent-193, score-0.534]

81 However, this artifact is largely due to our choice of exploration rate, rather than a large error in the estimated reward values. [sent-194, score-0.482]

82 Approximately half of the agents ﬁnd the optimal policy, while the other half are still exploring and learning. [sent-196, score-0.375]

83 An interesting phenomenon occurs when these other agents ﬁnally ﬁnd the optimal policy and begin receiving higher rewards. [sent-197, score-0.549]

84 Suddenly the performance drops drastically for the agents who had found the optimal policy ﬁrst. [sent-198, score-0.527]

85 When the other agents learn an optimal policy, they begin affecting the global reward, contributing some positive amount rather than a consistent zero. [sent-200, score-0.538]

86 This changes the world dynamics for the agents who had already learned the optimal policy and causes them to “unlearn” their good behavior. [sent-201, score-0.728]

87 The unstable dynamics of the Q-learners could be solved if the agents had full observability, and we could learn using the joint actions of all the agents, as in the work of Claus and Boutilier [1998]. [sent-202, score-0.502]

88 However, since our premise is that agents have only a limited view of the world, the Q-learning agents will only exhibit convergence to the optimal policy if they converge to the optimal policy simultaneously. [sent-203, score-1.117]

89 This may take a prohibitively long time, especially as the number of agents grows. [sent-204, score-0.343]

90 Mobilized ad-hoc networking provides an interesting real-world environment that illustrates the importance of reward ﬁltering due to its high degree of partial observability and a reward signal that depends on the global state. [sent-206, score-1.328]

91 The nodes are limited to having only local knowledge of their immediate neighboring grid locations (rather than the numbered state locations as in the original grid world), and thus do not know their absolute location on the grid. [sent-212, score-0.332]

92 They are trained using a global reward signal that is a measure of total network performance, and their actions are limited functions that map their local state to N, S, E, W movements. [sent-213, score-0.82]

93 Source nodes move around randomly, and in our example here, there are two sources and eight mobile agent nodes in a 4x4 grid. [sent-216, score-0.662]

94 This setup is shown in Figure 4, and the graph shows a comparison of an ordinary Q-learner and the ﬁltering learner, plotting the increase in global rewards over time as the agents learn to perform their task as intermediate network nodes. [sent-217, score-0.822]

95 In all the above work, we have assumed that the reward signal is deterministic – each state, action pair only produces a single reward value. [sent-221, score-1.1]

96 There are some domains in which we’d like to model the reward as being stochastic, such as the multi-armed bandit problem. [sent-222, score-0.535]

97 However, allowing some variance would give the model an observation noise term that could reﬂect the stochasticity of the reward signal. [sent-225, score-0.642]

98 However, since we cannot guarantee an exact estimate of the reward values when the model is not an exact representation of the world, the agent may make the wrong policy decision sometimes. [sent-227, score-1.092]

99 In the majority of cases, the estimates are good enough to lead the agent to learning a good policy. [sent-229, score-0.575]

100 Policy invariance under reward transformations: theory and application to reward shaping. [sent-274, score-0.964]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reward', 0.482), ('agent', 0.458), ('agents', 0.343), ('ltering', 0.241), ('rewards', 0.241), ('kalman', 0.202), ('policy', 0.152), ('bt', 0.117), ('world', 0.112), ('signal', 0.108), ('domain', 0.088), ('state', 0.087), ('noise', 0.086), ('gt', 0.077), ('observability', 0.077), ('grid', 0.076), ('learn', 0.074), ('global', 0.067), ('personal', 0.064), ('evolves', 0.064), ('szita', 0.056), ('move', 0.055), ('barto', 0.055), ('actions', 0.052), ('unobservable', 0.049), ('collective', 0.049), ('chang', 0.049), ('claus', 0.048), ('mcmahan', 0.048), ('networking', 0.048), ('wolpert', 0.048), ('reinforcement', 0.048), ('xt', 0.048), ('sutton', 0.048), ('mobile', 0.045), ('pt', 0.045), ('ordinary', 0.045), ('choi', 0.044), ('auer', 0.041), ('estimates', 0.041), ('fairly', 0.04), ('nodes', 0.04), ('et', 0.039), ('converge', 0.039), ('correspondingly', 0.039), ('environment', 0.038), ('states', 0.037), ('ascertain', 0.037), ('craft', 0.037), ('mobilized', 0.037), ('tumer', 0.037), ('update', 0.035), ('kt', 0.035), ('qt', 0.034), ('dynamics', 0.033), ('optimal', 0.032), ('converging', 0.032), ('csail', 0.032), ('guesses', 0.032), ('learned', 0.031), ('wt', 0.03), ('environments', 0.029), ('games', 0.029), ('perfectly', 0.029), ('learner', 0.029), ('mdp', 0.029), ('credit', 0.029), ('numbered', 0.029), ('stochasticity', 0.029), ('action', 0.028), ('bandit', 0.027), ('graph', 0.027), ('modes', 0.026), ('discount', 0.026), ('true', 0.026), ('learning', 0.026), ('partial', 0.026), ('student', 0.026), ('visit', 0.026), ('domains', 0.026), ('noisy', 0.025), ('observable', 0.025), ('good', 0.025), ('time', 0.025), ('visits', 0.024), ('ho', 0.024), ('approached', 0.024), ('cooperative', 0.024), ('evolve', 0.024), ('sources', 0.024), ('lter', 0.024), ('limited', 0.024), ('markov', 0.024), ('boutilier', 0.023), ('observation', 0.023), ('ng', 0.023), ('bold', 0.022), ('external', 0.022), ('term', 0.022), ('begin', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

2 0.34856591 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

Author: Kazuyuki Samejima, Kenji Doya, Yasumasa Ueda, Minoru Kimura

Abstract: When we model a higher order functions, such as learning and memory, we face a difﬁculty of comparing neural activities with hidden variables that depend on the history of sensory and motor signals and the dynamics of the network. Here, we propose novel method for estimating hidden variables of a learning agent, such as connection weights from sequences of observable variables. Bayesian estimation is a method to estimate the posterior probability of hidden variables from observable data sequence using a dynamic model of hidden and observable variables. In this paper, we apply particle ﬁlter for estimating internal parameters and metaparameters of a reinforcement learning model. We veriﬁed the effectiveness of the method using both artiﬁcial data and real animal behavioral data. 1

3 0.34452319 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efﬁcient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions). 1

4 0.31202704 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

Author: Saori C. Tanaka, Kenji Doya, Go Okada, Kazutaka Ueda, Yasumasa Okamoto, Shigeto Yamawaki

Abstract: To understand the brain mechanisms involved in reward prediction on different time scales, we developed a Markov decision task that requires prediction of both immediate and future rewards, and analyzed subjects’ brain activities using functional MRI. We estimated the time course of reward prediction and reward prediction error on different time scales from subjects' performance data, and used them as the explanatory variables for SPM analysis. We found topographic maps of different time scales in medial frontal cortex and striatum. The result suggests that different cortico-basal ganglia loops are specialized for reward prediction on different time scales. 1 Intro du ction In our daily life, we make decisions based on the prediction of rewards on different time scales; immediate and long-term effects of an action are often in conflict, and biased evaluation of immediate or future outcome can lead to pathetic behaviors. Lesions in the central serotonergic system result in impulsive behaviors in humans [1], and animals [2, 3], which can be attributed to deficits in reward prediction on a long time scale. Damages in the ventral part of medial frontal cortex (MFC) also cause deficits in decision-making that requires assessment of future outcomes [4-6]. A possible mechanism underlying these observations is that different brain areas are specialized for reward prediction on different time scales, and that the ascending serotonergic system activates those specialized for predictions in longer time scales [7]. The theoretical framework of temporal difference (TD) learning [8] successfully explains reward-predictive activities of the midbrain dopaminergic system as well as those of the cortex and the striatum [9-13]. In TD learning theory, the predicted amount of future reward starting from a state s(t) is formulated as the “value function” V(t) = E[r(t + 1) + γ r(t + 2) + γ 2r(t + 3) + …] (1) and learning is based on the TD error δ(t) = r(t) + γ V(t) – V(t - 1). (2) The ‘discount factor’ γ controls the time scale of prediction; while only the immediate reward r(t + 1) is considered with γ = 0, rewards in the longer future are taken into account with γ closer to 1. In order to test the above hypothesis [7], we developed a reinforcement learning task which requires a large value of discount factor for successful performance, and analyzed subjects’ brain activities using functional MRI. In addition to conventional block-design analysis, a novel model-based regression analysis revealed topographic representation of prediction time scale with in the cortico-basal ganglia loops. 2 2.1 Methods Markov Decision Task In the Markov decision task (Fig. 1), markers on the corners of a square present four states, and the subject selects one of two actions by pressing a button (a1 = left button, a2 = right button) (Fig. 1A). The action determines both the amount of reward and the movement of the marker (Fig. 1B). In the REGULAR condition, the next trial is started from the marker position at the end of the previous trial. Therefore, in order to maximize the reward acquired in a long run, the subject has to select an action by taking into account both the immediate reward and the future reward expected from the subsequent state. The optimal behavior is to receive small negative rewards at states s 2, s3, and s4 to obtain a large positive reward at state s1 (Fig. 1C). In the RANDOM condition, next trial is started from a random marker position so that the subject has to consider only immediate reward. Thus, the optimal behavior is to collect a larger reward at each state (Fig. 1D). In the baseline condition (NO condition), the reward is always zero. In order to learn the optimal behaviors, the discount factor γ has to be larger than 0.3425 in REGULAR condition, while it can be arbitrarily small in RANDOM condition. 2.2 fMRI imaging Eighteen healthy, right-handed volunteers (13 males and 5 females), gave informed consent to take part in the study, with the approval of the ethics and safety committees of ATR and Hiroshima University. A 0 Time 1.0 2.0 2.5 3.0 100 C B +r 2 s2 s1 REGULAR condition s2 -r 1 -r 2 +r 1 s1 100 D RANDOM condition +r 2 s2 s1 -r 1 +r 1 -r 2 -r 1 s4 +r 2 4.0 (s) -r 1 s3 a1 a2 r1 = 20 10 yen r2 = 100 10 yen +r 1 -r 1 s4 -r 1 -r 1 s3 s4 -r 1 s3 Fig. 1. (A) Sequence of stimulus and response events in the Markov decision task. First, one of four squares representing present state turns green (0s). As the fixation point turns green (1s), the subject presses either the right or left button within 1 second. After 1s delay, the green square changes its position (2s), and then a reward for the current action is presented by a number (2.5s) and a bar graph showing cumulative reward during the block is updated (3.0s). One trial takes four seconds. Subjects performed five trials in the NO condition, 32 trials in the RANDOM condition, five trials in the NO condition, and 32 trials in the REGULAR condition in one block. They repeated four blocks; thus, the entire experiment consisted of 312 trials, taking about 20 minutes. (B) The rule of the reward and marker movement. (C) In the REGULAR condition, the optimal behavior is to receive small negative rewards –r 1 (-10, -20, or -30 yen) at states s2, s3, and s4 to obtain a large positive reward +r2 (90, 100, or 110 yen) at state s1. (D) In the RANDOM condition, the next trial is started from random state. Thus, the optimal behavior is to select a larger reward at each state. A 1.5-Tesla scanner (Marconi, MAGNEX ECLIPSE, Japan) was used to acquire both structural T1-weighted images (TR = 12 s, TE = 450 ms, flip angle = 20 deg, matrix = 256 × 256, FoV = 256 mm, thickness = 1 mm, slice gap = 0 mm ) and T2*-weighted echo planar images (TR = 4 s, TE = 55 msec, flip angle = 90 deg, 38 transverse slices, matrix = 64 × 64, FoV = 192 mm, thickness = 4 mm, slice gap = 0 mm, slice gap = 0 mm) with blood oxygen level-dependent (BOLD) contrast. 2.3 Data analysis The data were preprocessed and analyzed with SPM99 (Friston et al., 1995; Wellcome Department of Cognitive Neurology, London, UK). The first three volumes of images were discarded to avoid T1 equilibrium effects. The images were realigned to the first image as a reference, spatially normalized with respect to the Montreal Neurological Institute EPI template, and spatially smoothed with a Gaussian kernel (8 mm, full-width at half-maximum). A RANDOM condition action larger reward Fig. 2. The selected action of a representative single subject (solid line) and the group average ratio of selecting optimal action (dashed line) in (A) RANDOM and (B) REGULAR conditions. smaller reward 1 32 64 96 128 96 128 trial REGULAR condition B action optimal nonoptimal 1 32 64 trial Images of parameter estimates for the contrast of interest were created for each subject. These were then used for a second-level group analysis using a one-sample t-test across the subjects (random effects analysis). We conducted two types of analysis. One was block design analysis using three boxcar regressors convolved with a hemodynamic response function as the reference waveform for each condition (RANDOM, REGULAR, and NO). The other was multivariate regression analysis using explanatory variables, representing the time course of the reward prediction V(t) and reward prediction error δ(t) estimated from subjects’ performance data (described below), in addition to three regressors representing the condition of the block. 2.4 Estimation of predicted reward V(t) and prediction error δ(t) The time course of reward prediction V(t) and reward prediction error δ(t) were estimated from each subject’s performance data, i.e. state s(t), action a(t), and reward r(t), as follows. If the subject starts from a state s(t) and comes back to the same state after k steps, the expected cumulative reward V(t) should satisfy the consistency condition V(t) = r(t + 1) + γ r(t + 2) + … + γ k-1 r(t + k) + γ kV(t). (3) Thus, for each time t of the data file, we calculated the weighted sum of the rewards acquired until the subject returned to the same state and estimated the value function for that episode as  r ( t + 1) + γ r ( t + 2 ) + ... + γ k −1r ( t + k )  . ˆ (t ) =  V 1− γ k (4) The estimate of the value function V(t) at time t was given by the average of all previous episodes from the same state as at time t V (t ) = 1 L L ∑ Vˆ ( t ) , l (5) l =1 where {t1, …, tL} are the indices of time visiting the same state as s(t), i.e. s(t1) = … = s(tL) = s(t). The TD error was given by the difference between the actual reward r(t) and the temporal difference of the value function V(t) according to equation (2). Assuming that different brain areas are involved in reward prediction on different time scales, we varied the discount factor γ as 0, 0.3, 0.6, 0.8, 0.9, and 0.99. Fig. 3. (A) In REGULAR vs. RANDOM comparison, significant activation was observed in DLPFC ((x, y, z) = (46, 45, 9), peak t = 4.06) (p < 0.001 uncorrected). (B) In RANDOM vs. REGULAR comparison, significant activation was observed in lateral OFC ((x, y, z) = (-32, 9, -21), peak t = 4.90) (p < 0.001 uncorrected). 3 3.1 R e sul t s Behavioral results Figure 2 summarizes the learning performance of a representative single subject (solid line) and group average (dashed line) during fMRI measurement. Fourteen subjects successfully learned to take larger immediate rewards in the RANDOM condition (Fig. 2A) and a large positive reward at s1 after small negative rewards at s2, s3 and s4 in the REGULAR condition (Fig. 2B). 3.2 Block-design analysis In REGULAR vs. RANDOM contrast, we observed a significant activation in the dorsolateral prefrontal cortex (DLPFC) (Fig. 3A) (p < 0.001 uncorrected). In RANDOM vs. REGULAR contrast, we observed a significant activation in lateral orbitofrontal cortex (lOFC) (Fig. 3B) (p < 0.001 uncorrected). The result of block-design analysis suggests differential involvement of neural pathways in reward prediction on long and short time scales. The result in RANDOM vs. REGULAR contrast was consistent with previous studies that the OFC is involved in reward prediction within a short delay and reward outcome [14-20]. 3.3 Regression analysis We observed significant correlation with reward prediction V(t) in the MFC, DLPFC (all γ ), ventromedial insula (small γ ), dorsal striatum, amygdala, hippocampus, and parahippocampal gyrus (large γ ) (p < 0.001 uncorrected) (Fig. 4A). We also found significant correlation with reward prediction error δ(t) in the IPC, PMd, cerebellum (all γ ), ventral striatum (small γ ), and lateral OFC (large γ ) (p < 0.001 uncorrected) (Fig. 4B). As we changed the time scale parameter γ of reward prediction, we found rostro-caudal maps of correlation to V(t) in MFC with increasing γ. Fig. 4. Voxels with a significant correlation (p < 0.001 uncorrected) with reward prediction V(t) and prediction error δ(t) are shown in different colors for different settings of the time scale parameter (γ = 0 in red, γ = 0.3 in orange, γ = 0.6 in yellow, γ = 0.8 in green, γ = 0.9 in cyan, and γ = 0.99 in blue). Voxels correlated with two or more regressors are shown by a mosaic of colors. (A) Significant correlation with reward prediction V(t) was observed in the MFC, DLPFC, dorsal striatum, insula, and hippocampus. Note the anterior-ventral to posterior-dorsal gradient with the increase in γ in the MFC. (B) Significant correlation with reward prediction error δ(t) on γ = 0 was observed in the ventral striatum. 4 D i s c u ss i o n In the MFC, anterior and ventral part was involved in reward prediction V(t) on shorter time scales (0 ≤ γ ≤ 0.6), whereas posterior and dorsal part was involved in reward prediction V(t) on longer time scales (0.6 ≤ γ ≤ 0.99). The ventral striatum involved in reward prediction error δ(t) on shortest time scale (γ = 0), while the dorsolateral striatum correlated with reward prediction V(t) on longer time scales (0.9 ≤ γ ≤ 0.99). These results are consistent with the topographic organization of fronto-striatal connection; the rostral part of the MFC project to the ventral striatum, whereas the dorsal and posterior part of the cingulate cortex project to the dorsolateral striatum [21]. In the MFC and the striatum, no significant difference in activity was observed in block-design analysis while we did find graded maps of activities with different values of γ. A possible reason is that different parts of the MFC and the striatum are concurrently involved with reward prediction on different time scales, regardless of the task context. Activities of the DLPFC and lOFC, which show significant differences in block-design analysis (Fig. 3), may be regulated according to the necessity for the task; From these results, we propose the following mechanism of reward prediction on different time scales. The parallel cortico-basal ganglia loops are responsible for reward prediction on various time scales. The ‘limbic loop’ via the ventral striatum specializes in immediate reward prediction, whereas the ‘cognitive and motor loop’ via the dorsal striatum specialises in future reward prediction. Each loop learns to predict rewards on its specific time scale. To perform an optimal action under a given time scale, the output of the loop with an appropriate time scale is used for actual action selection. Previous studies in brain damages and serotonergic functions suggest that the MFC and the dorsal raphe, which are reciprocally connected [22, 23], play an important role in future reward prediction. The cortico-cortico projections from the MFC, or the serotonergic projections from the dorsal raphe to the cortex and the striatum may be involved in the modulation of these parallel loops. In present study, using a novel regression analysis based on subjects’ performance data and reinforcement learning model, we revealed the maps of time scales in reward prediction, which could not be found by conventional block-design analysis. Future studies using this method under pharmacological manipulation of the serotonergic system would clarify the role of serotonin in regulating the time scale of reward prediction. Acknowledgments We thank Nicolas Schweighofer, Kazuyuki Samejima, Masahiko Haruno, Hiroshi Imamizu, Satomi Higuchi, Toshinori Yoshioka, and Mitsuo Kawato for helpful discussions and technical advice. References [1] Rogers, R.D., et al. (1999) Dissociable deficits in the decision-making cognition of chronic amphetamine abusers, opiate abusers, patients with focal damage to prefrontal cortex, and tryptophan-depleted normal volunteers: evidence for monoaminergic mechanisms. Neuropsychopharmacology 20(4):322-339. [2] Evenden, J.L. & Ryan, C.N. (1996) The pharmacology of impulsive behaviour in rats: the effects of drugs on response choice with varying delays of reinforcement. Psychopharmacology (Berl) 128(2):161-170. [3] Mobini, S., et al. (2000) Effects of central 5-hydroxytryptamine depletion on sensitivity to delayed and probabilistic reinforcement. Psychopharmacology (Berl) 152(4):390-397. [4] Bechara, A., et al. (1994) Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50(1-3):7-15. [5] Bechara, A., Tranel, D. & Damasio, H. (2000) Characterization of the decision-making deficit of patients with ventromedial prefrontal cortex lesions. Brain 123:2189-2202. [6] Mobini, S., et al. (2002) Effects of lesions of the orbitofrontal cortex on sensitivity to delayed and probabilistic reinforcement. Psychopharmacology (Berl) 160(3):290-298. [7] Doya, K. (2002) 15(4-6):495-506. Metalearning and neuromodulation. Neural Netw [8] Sutton, R.S., Barto, A. G. (1998) Reinforcement learning. Cambridge, MA: MIT press. [9] Houk, J.C., Adams, J.L. & Barto, A.G., A model of how the basal ganglia generate and use neural signals that predict reinforcement, in Models of information processing in the basal ganglia, J.C. Houk, J.L. Davis, and D.G. Beiser, Editors. 1995, MIT Press: Cambridge, Mass. p. 249-270. [10] Schultz, W., Dayan, P. & Montague, P.R. (1997) A neural substrate of prediction and reward. Science 275(5306):1593-1599. [11] Doya, K. (2000) Complementary roles of basal ganglia and cerebellum in learning and motor control. Curr Opin Neurobiol 10(6):732-739. [12] Berns, G.S., et al. (2001) Predictability modulates human brain response to reward. J Neurosci 21(8):2793-2798. [13] O'Doherty, J.P., et al. (2003) Temporal difference models and reward-related learning in the human brain. Neuron 38(2):329-337. [14] Koepp, M.J., et al. (1998) Evidence for striatal dopamine release during a video game. Nature 393(6682):266-268. [15] Rogers, R.D., et al. (1999) Choosing between small, likely rewards and large, unlikely rewards activates inferior and orbital prefrontal cortex. J Neurosci 19(20):9029-9038. [16] Elliott, R., Friston, K.J. & Dolan, R.J. (2000) Dissociable neural responses in human reward systems. J Neurosci 20(16):6159-6165. [17] Breiter, H.C., et al. (2001) Functional imaging of neural responses to expectancy and experience of monetary gains and losses. Neuron 30(2):619-639. [18] Knutson, B., et al. (2001) Anticipation of increasing monetary reward selectively recruits nucleus accumbens. J Neurosci 21(16):RC159. [19] O'Doherty, J.P., et al. (2002) Neural responses during anticipation of a primary taste reward. Neuron 33(5):815-826. [20] Pagnoni, G., et al. (2002) Activity in human ventral striatum locked to errors of reward prediction. Nat Neurosci 5(2):97-98. [21] Haber, S.N., et al. (1995) The orbital and medial prefrontal circuit through the primate basal ganglia. J Neurosci 15(7 Pt 1):4851-4867. [22] Celada, P., et al. (2001) Control of dorsal raphe serotonergic neurons by the medial prefrontal cortex: Involvement of serotonin-1A, GABA(A), and glutamate receptors. J Neurosci 21(24):9917-9929. [23] Martin-Ruiz, R., et al. (2001) Control of serotonergic function in medial prefrontal cortex by serotonin-2A receptors through a glutamate-dependent mechanism. J Neurosci 21(24):9856-9866.

5 0.27291846 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

Author: David C. Parkes, Satinder P. Singh

Abstract: Online mechanism design (MD) considers the problem of providing incentives to implement desired system-wide outcomes in systems with self-interested agents that arrive and depart dynamically. Agents can choose to misrepresent their arrival and departure times, in addition to information about their value for diﬀerent outcomes. We consider the problem of maximizing the total longterm value of the system despite the self-interest of agents. The online MD problem induces a Markov Decision Process (MDP), which when solved can be used to implement optimal policies in a truth-revealing Bayesian-Nash equilibrium. 1

6 0.27147353 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

7 0.22094963 78 nips-2003-Gaussian Processes in Reinforcement Learning

8 0.21307616 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

9 0.21277717 68 nips-2003-Eye Movements for Reward Maximization

10 0.16962156 55 nips-2003-Distributed Optimization in Adaptive Networks

11 0.1557364 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

12 0.14301601 158 nips-2003-Policy Search by Dynamic Programming

13 0.13990283 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

14 0.13861027 165 nips-2003-Reasoning about Time and Knowledge in Neural Symbolic Learning Systems

15 0.13144247 62 nips-2003-Envelope-based Planning in Relational MDPs

16 0.11334947 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

17 0.10703656 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

18 0.094549179 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

19 0.088874787 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

20 0.084054537 42 nips-2003-Bounded Finite State Controllers

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.252), (1, 0.534), (2, -0.16), (3, -0.089), (4, -0.057), (5, 0.096), (6, -0.007), (7, 0.086), (8, -0.171), (9, -0.165), (10, 0.097), (11, 0.116), (12, -0.079), (13, -0.163), (14, -0.04), (15, 0.013), (16, 0.026), (17, -0.009), (18, -0.099), (19, 0.014), (20, 0.052), (21, 0.129), (22, 0.043), (23, 0.06), (24, 0.039), (25, 0.085), (26, 0.09), (27, 0.106), (28, -0.01), (29, 0.094), (30, 0.051), (31, 0.075), (32, -0.02), (33, 0.112), (34, -0.06), (35, -0.023), (36, -0.003), (37, 0.024), (38, 0.041), (39, 0.032), (40, -0.04), (41, -0.023), (42, 0.063), (43, 0.007), (44, -0.026), (45, 0.022), (46, -0.002), (47, 0.006), (48, -0.014), (49, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97889978 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

2 0.78865403 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

Author: Xiaofeng Wang, Tuomas Sandholm

3 0.74079835 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

Author: Saori C. Tanaka, Kenji Doya, Go Okada, Kazutaka Ueda, Yasumasa Okamoto, Shigeto Yamawaki

4 0.72724336 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

Author: Kazuyuki Samejima, Kenji Doya, Yasumasa Ueda, Minoru Kimura

5 0.69546205 68 nips-2003-Eye Movements for Reward Maximization

Author: Nathan Sprague, Dana Ballard

Abstract: Recent eye tracking studies in natural tasks suggest that there is a tight link between eye movements and goal directed motor actions. However, most existing models of human eye movements provide a bottom up account that relates visual attention to attributes of the visual scene. The purpose of this paper is to introduce a new model of human eye movements that directly ties eye movements to the ongoing demands of behavior. The basic idea is that eye movements serve to reduce uncertainty about environmental variables that are task relevant. A value is assigned to an eye movement by estimating the expected cost of the uncertainty that will result if the movement is not made. If there are several candidate eye movements, the one with the highest expected value is chosen. The model is illustrated using a humanoid graphic ﬁgure that navigates on a sidewalk in a virtual urban environment. Simulations show our protocol is superior to a simple round robin scheduling mechanism. 1

6 0.68120092 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

7 0.64090824 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

8 0.53935367 165 nips-2003-Reasoning about Time and Knowledge in Neural Symbolic Learning Systems

9 0.50075841 55 nips-2003-Distributed Optimization in Adaptive Networks

10 0.46070081 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

11 0.4518165 78 nips-2003-Gaussian Processes in Reinforcement Learning

12 0.44145167 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

13 0.39469525 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

14 0.37280998 158 nips-2003-Policy Search by Dynamic Programming

15 0.36218885 62 nips-2003-Envelope-based Planning in Relational MDPs

16 0.35968718 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

17 0.34534565 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

18 0.34330624 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

19 0.31384471 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

20 0.29721949 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.054), (11, 0.013), (29, 0.029), (30, 0.05), (35, 0.078), (53, 0.087), (69, 0.014), (71, 0.06), (75, 0.127), (76, 0.027), (84, 0.01), (85, 0.152), (91, 0.173), (99, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93117708 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

2 0.87554675 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

Author: Ryan C. Kelly, Tai Sing Lee

Abstract: Decoding is a strategy that allows us to assess the amount of information neurons can provide about certain aspects of the visual scene. In this study, we develop a method based on Bayesian sequential updating and the particle ﬁltering algorithm to decode the activity of V1 neurons in awake monkeys. A distinction in our method is the use of Volterra kernels to ﬁlter the particles, which live in a high dimensional space. This parametric Bayesian decoding scheme is compared to the optimal linear decoder and is shown to work consistently better than the linear optimal decoder. Interestingly, our results suggest that for decoding in real time, spike trains of as few as 10 independent but similar neurons would be sufﬁcient for decoding a critical scene variable in a particular class of visual stimuli. The reconstructed variable can predict the neural activity about as well as the actual signal with respect to the Volterra kernels. 1

3 0.85395074 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

Author: Maneesh Sahani

Abstract: Signiﬁcant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a signiﬁcant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1

4 0.84123152 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

Author: Liva Ralaivola, Florence D'alché-buc

Abstract: We consider the question of predicting nonlinear time series. Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. The kernel trick is used twice: ﬁrst, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. 1

5 0.83974385 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classiﬁers of equivalent quality. This faster method could be used for more demanding classiﬁcation tasks, such as on-line learning. 1

6 0.83955461 68 nips-2003-Eye Movements for Reward Maximization

7 0.83908409 78 nips-2003-Gaussian Processes in Reinforcement Learning

8 0.8389402 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

9 0.83375227 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

10 0.83349615 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

11 0.82848507 158 nips-2003-Policy Search by Dynamic Programming

12 0.8277548 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

13 0.82543218 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

14 0.82430243 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

15 0.82080311 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

16 0.819965 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

17 0.81969446 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

18 0.81955206 124 nips-2003-Max-Margin Markov Networks

19 0.81674761 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

20 0.81615198 3 nips-2003-AUC Optimization vs. Error Rate Minimization