nips nips2003 nips2003-68 knowledge-graph by maker-knowledge-mining

68 nips-2003-Eye Movements for Reward Maximization

Source: pdf

Author: Nathan Sprague, Dana Ballard

Abstract: Recent eye tracking studies in natural tasks suggest that there is a tight link between eye movements and goal directed motor actions. However, most existing models of human eye movements provide a bottom up account that relates visual attention to attributes of the visual scene. The purpose of this paper is to introduce a new model of human eye movements that directly ties eye movements to the ongoing demands of behavior. The basic idea is that eye movements serve to reduce uncertainty about environmental variables that are task relevant. A value is assigned to an eye movement by estimating the expected cost of the uncertainty that will result if the movement is not made. If there are several candidate eye movements, the one with the highest expected value is chosen. The model is illustrated using a humanoid graphic ﬁgure that navigates on a sidewalk in a virtual urban environment. Simulations show our protocol is superior to a simple round robin scheduling mechanism. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Recent eye tracking studies in natural tasks suggest that there is a tight link between eye movements and goal directed motor actions. [sent-5, score-1.263]

2 However, most existing models of human eye movements provide a bottom up account that relates visual attention to attributes of the visual scene. [sent-6, score-0.978]

3 The purpose of this paper is to introduce a new model of human eye movements that directly ties eye movements to the ongoing demands of behavior. [sent-7, score-1.665]

4 The basic idea is that eye movements serve to reduce uncertainty about environmental variables that are task relevant. [sent-8, score-0.947]

5 A value is assigned to an eye movement by estimating the expected cost of the uncertainty that will result if the movement is not made. [sent-9, score-0.71]

6 If there are several candidate eye movements, the one with the highest expected value is chosen. [sent-10, score-0.554]

7 The model is illustrated using a humanoid graphic ﬁgure that navigates on a sidewalk in a virtual urban environment. [sent-11, score-0.506]

8 Simulations show our protocol is superior to a simple round robin scheduling mechanism. [sent-12, score-0.424]

9 1 Introduction This paper introduces a new framework for understanding the scheduling of human eye movements. [sent-13, score-0.67]

10 The human eye is characterized by a small, high resolution fovea. [sent-14, score-0.543]

11 The importance of foveal vision means that fast ballistic eye movements called saccades are made at a rate of approximately three per second to direct gaze to relevant areas of the visual ﬁeld. [sent-15, score-1.014]

12 Since the location of the fovea provides a powerful clue to what information the visual system is processing, understanding the scheduling and targeting of eye movements is key to understanding the organization of human vision. [sent-16, score-1.074]

13 The recent advent of portable eye-trackers has made it possible to study eye movements in everyday behaviors. [sent-17, score-0.779]

14 These studies show that behaviors such as driving [1, 2] or navigating a city sidewalk [3] show rapid alternating saccades to different targets indicative of competing perceptual demands. [sent-18, score-0.751]

15 In contrast, our underlying premise is that much of routine human behavior can be understood in the framework of reward maximization. [sent-22, score-0.287]

16 In other words, humans choose actions by trading off the cost of the actions versus their beneﬁts. [sent-23, score-0.217]

17 One obvious way of modeling eye movement selection is to use a reinforcement learning strategy directly. [sent-26, score-0.673]

18 However, standard reinforcement learning algorithms are are best suited to handling actions that have direct consequences for a task. [sent-27, score-0.233]

19 Actions such as eye movements are more difﬁcult to put in a reinforcement learning framework because they have indirect consequences: they do not change the state of the environment; they serve only to obtain information. [sent-28, score-1.044]

20 We show a way of overcoming this difﬁculty while preserving the notion of reward maximization in the scheduling of eye movements. [sent-29, score-0.721]

21 The basic idea is that eye movements serve to reduce uncertainty about environmental variables that are relevant to behavior. [sent-30, score-0.936]

22 A value is assigned to an eye movement by estimating the expected cost of the uncertainty that will result if the movement is not made. [sent-31, score-0.71]

23 If there are several candidate eye movements, the one with the highest potential loss is chosen. [sent-32, score-0.58]

24 The agent is faced with multiple simultaneous goals including walking along a sidewalk, picking up litter, and avoiding obstacles. [sent-34, score-0.236]

25 He must schedule simulated eye movements so as to maximize his reward across the set of goals. [sent-35, score-0.889]

26 We model eye movements as abstract sensory actions that serve to retrieve task relevant information from the environment. [sent-36, score-1.042]

27 Our focus is on temporal scheduling; we are not concerned with the spatial targeting of eye movements. [sent-37, score-0.523]

28 The purpose of this paper is to recast the question of how eye movements are scheduled, and to propose a possible answer. [sent-38, score-0.803]

29 For the purpose of modeling human performance it is assumed that each behavior has the ability to direct the eye, perform appropriate visual processing to retrieve the information necessary for performance of the behavior’s task, and choose an appropriate course of action. [sent-43, score-0.28]

30 As long as only one goal is active at a time the behavior based approach is straightforward: the appropriate behavior is put in control and has all the machinery necessary to pursue the goal. [sent-44, score-0.261]

31 In the following sections we will describe how physical control is arbitrated, and building on that framework, how eye movements are arbitrated. [sent-47, score-0.827]

32 Our approach to designing behaviors is to model each behavior’s task as a Markov decision process and then ﬁnd good policies using reinforcement learning. [sent-48, score-0.356]

33 An MDP is described by a 4-tuple (S, A, T, R), where S is the state space, A is the action space, and T (s, a, s ) is the transition function that indicates the probability of arriving in state s when action a is taken in state s. [sent-49, score-0.637]

34 The reward function R(s, a) denotes the expected one-step payoff for taking action a in state s. [sent-50, score-0.411]

35 This function denotes the expected discounted return if action a is taken in state s and the optimal policy is followed thereafter. [sent-54, score-0.341]

36 If Q(s, a) is known then the learning agent can behave optimally by always choosing arg maxa Q(s, a). [sent-55, score-0.253]

37 Here we assume that the behaviors share an action space. [sent-59, score-0.318]

38 The theoretical foundations of value based continuous state reinforcement learning are not as well established as for the discrete state case. [sent-63, score-0.364]

39 For reasons of space this paper will not include a complete description of the training procedure used to obtain the Q-functions for the sidewalk task. [sent-66, score-0.424]

40 3 A Composite Task: Sidewalk Navigation The components of the sidewalk navigation task are to stay on the sidewalk, avoid obstacles, and pick up litter. [sent-68, score-0.524]

41 Our sidewalk navigation model has three behaviors, sidewalk following, obstacle avoidance, and litter collection. [sent-70, score-1.246]

42 These behaviors share an action space composed of three actions: 15o right turn, 15o left turn, and no turn (medium gray, dark gray, and light gray arrows in Figure 1). [sent-71, score-0.361]

43 During the sidewalk navigation task the virtual human walks forward at a steady rate of 1. [sent-72, score-0.631]

44 Every 300ms a new action is selected according to the action selection mechanism summarized in Equation (1). [sent-74, score-0.304]

45 Each of the three behaviors has a two dimensional state space. [sent-75, score-0.307]

46 For obstacle avoidance the state space is comprised of the distance and angle, relative to the agent, to the nearest obstacle. [sent-76, score-0.375]

47 The litter collection behavior uses the same parameterization for the nearest litter item. [sent-77, score-0.514]

48 All behaviors use the log of distance in order to 16. [sent-79, score-0.214]

49 Figures a)-c) show max a Q(s, a) for the three behaviors: a) obstacle avoidance, b) sidewalk following and c) litter collection. [sent-120, score-0.757]

50 The agent receives two units of reward for every item of litter collected , one unit for every time step he remains on the sidewalk, and four units for every time step he does not collide with an obstacle. [sent-124, score-0.565]

51 The behaviors use simple sensory routines to retrieve the relevant state information from the environment. [sent-126, score-0.435]

52 The sidewalk following behavior searches for pixels at the border of the sidewalk and the grass, and ﬁnds the most prominent line using a hough transform. [sent-127, score-0.938]

53 The litter collection routine uses color based matching to ﬁnd the location of litter items. [sent-128, score-0.452]

54 The obstacle avoidance routines refers to the world model directly to compute a rough depth map of the area ahead, and from that extracts the position of the nearest obstacle. [sent-129, score-0.253]

55 It allows us to represent the consequences of not having the most recent information from an eye movement. [sent-134, score-0.521]

56 With this information the behaviors may treat their state estimates as continuous random variables with known probability distributions. [sent-137, score-0.363]

57 The other useful property of the Kalman ﬁlter is that it is able to propagate state estimates in the absence of sensory information. [sent-138, score-0.212]

58 In order to simulate the fact that only one area of the visual ﬁeld may be foveated, only one behavior is allowed access to perception during each 300ms time step. [sent-140, score-0.242]

59 Since the agent does not have perfectly up to date state information, he must select the best action given his current estimates of the state. [sent-143, score-0.498]

60 A reasonable way of selecting an action under uncertainty is to select the action with the highest expected return. [sent-144, score-0.426]

61 Selecting the action with the highest expected return does not guarantee that the agent will choose the best action for the true state of the environment. [sent-148, score-0.684]

62 Whenever the agent chooses an action that is sub-optimal for the true state of the environment, he can expect to lose some return. [sent-149, score-0.497]

63 (3) The term on the left-hand side of the minus sign expresses the expected return that the agent would receive if he were able to act with knowledge of the true state of the environment. [sent-151, score-0.413]

64 The term on the right expresses the expected return if the agent is forced to choose an action based on his state estimate. [sent-152, score-0.547]

65 The total expected loss does not help to select which of the behaviors should be given access to perception. [sent-155, score-0.364]

66 To make this selection, the loss value needs to be broken down into the losses associated with the uncertainty for each particular behavior b: QE (si , a) i lossb = E max Qb (sb , a) + a i∈B,i=b QE (si , aE ). [sent-156, score-0.259]

67 The value on the left is the expected return if sb were known, but the other state variables were not. [sent-158, score-0.264]

68 Given that the Q functions are known, and that the Kalman ﬁlters provide distributions over the state variables, it is straightforward to estimate lossb for each behavior b by sampling. [sent-161, score-0.252]

69 This value is then used to select which behavior will make an eye movement. [sent-162, score-0.602]

70 Figure 2 gives an example of several steps of the sidewalk task, the associated eye movements, and the state estimates. [sent-163, score-1.031]

71 The eye movements are allocated to reduce the uncertainty where it has the greatest potential negative consequences for reward. [sent-164, score-0.876]

72 For example, the agent ﬁxates the obstacle as he draws close to it, and shifts perception to the other two behaviors when the obstacle has been safely passed. [sent-165, score-0.653]

73 For example if the obstacle avoidance behavior sees two obstacles it will initialize a ﬁlter for each. [sent-169, score-0.366]

74 However, only the single closest object is used to determine the state for the purpose of action selection and scheduling eye movements. [sent-170, score-0.928]

75 a) b) OA SF LC TIME Figure 2: a) An overhead view of the virtual agent during seven time steps of the sidewalk navigation task. [sent-171, score-0.72]

76 The rays projecting from the agent represent eye movements; gray rays correspond to obstacle avoidance, black rays correspond to sidewalk following, and white correspond to litter collection. [sent-173, score-1.617]

77 When present, the black regions correspond to the 90% conﬁdence bounds after an eye movement has been made. [sent-178, score-0.545]

78 The second and third rows show the corresponding information for the sidewalk following and litter collection tasks. [sent-179, score-0.636]

79 5 Results In order to test the effectiveness of the loss minimization approach, we compare it to two alternative scheduling mechanisms: round robin, which sequentially rotates through the three behaviors, and random, which makes a uniform random selection on each time step. [sent-180, score-0.362]

80 Round robin might be expected to perform well in this task, because it is optimal in terms of minimizing long waits across the three behaviors. [sent-181, score-0.218]

81 034 higher for the loss minimization scheduling than for the round robin scheduling. [sent-187, score-0.5]

82 The ﬁrst is that the reward scale for this task does not start at zero: when taking completely random actions the agent receives an average of 4. [sent-189, score-0.409]

83 The second factor to consider is the sheer number of eye movements that a human makes over the course of a day: a conservative estimate is 150,000. [sent-193, score-0.838]

84 The average beneﬁt of properly scheduling a single eye movement may be small, but the cumulative beneﬁt is enormous. [sent-194, score-0.672]

85 8 0% 33% percent eye movements blocked 66% Figure 3: Comparison of loss minimization scheduling to round robin and random strategies. [sent-201, score-1.307]

86 In the 33% and 66% conditions the corresponding percentage of eye movements are randomly blocked, and no sensory input is allowed. [sent-203, score-0.815]

87 037 indicates the average reward received when all three behaviors are given access to perception at each time step. [sent-206, score-0.376]

88 make this point more concrete, notice that over a period of one hour of sidewalk navigation the agent will lose around 370 units of reward if he uses round robin instead of the loss minimization approach. [sent-208, score-1.242]

89 In the currency of reward this is equal to 92 additional collisions with obstacles, 184 missed litter items, or two additional minutes spent off the sidewalk. [sent-209, score-0.322]

90 6 Related Work The action selection mechanism from Equation (2) is essentially a continuous state version of the Q-MDP algorithm for ﬁnding approximate solutions to POMDPs [16]. [sent-212, score-0.319]

91 The idea behind the Q-MDP algorithm is to ﬁrst solve the underlying MDP, and then choose actions according to arg maxa s bel(s)Q(s, a), where bel(s) is the probability that the system is in state s and Q(s, a) is the optimal value function for the underlying MDP. [sent-214, score-0.274]

92 In this work the Kalman ﬁlters serve precisely the role of maintaining a continuous belief state, and the problem of reducing uncertainty is handled through the separate mechanism of choosing eye movements to minimize loss. [sent-216, score-0.915]

93 The gaze control system introduced in [17] also addresses the problem of perceptual arbitration in the face of multiple goals. [sent-217, score-0.27]

94 7 Discussion and Conclusions Any system for controlling competing visuo-motor behaviors that all require access to a sensor such as the human eye faces a resource allocation problem. [sent-219, score-0.848]

95 Reward can be maximized by allocating gaze to the behavior that stands to lose the most. [sent-222, score-0.262]

96 As the simulations show, the performance of the algorithm is superior both to the round robin protocol and to a random allocation strategy. [sent-223, score-0.34]

97 It is possible for humans to examine locations in the visual scene without overt eye movements. [sent-224, score-0.609]

98 Finally, although the expected loss protocol is developed for eye movements, the computational strategy is very general and extends to any situation where there are multiple active behaviors that must compete for information gathering sensors. [sent-226, score-0.872]

99 The coordination of eye, head, and hand movements in a natural task. [sent-238, score-0.295]

100 When uncertainty matters: the selection of rapid goal-directed movements [abstract]. [sent-253, score-0.391]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eye', 0.484), ('sidewalk', 0.424), ('movements', 0.295), ('litter', 0.212), ('behaviors', 0.184), ('agent', 0.183), ('robin', 0.174), ('action', 0.134), ('scheduling', 0.127), ('state', 0.123), ('obstacle', 0.121), ('reward', 0.11), ('avoidance', 0.101), ('kalman', 0.093), ('round', 0.092), ('reinforcement', 0.092), ('rochester', 0.092), ('behavior', 0.09), ('actions', 0.081), ('arbitration', 0.077), ('qe', 0.077), ('sprague', 0.077), ('gaze', 0.076), ('inf', 0.075), ('loss', 0.07), ('visual', 0.07), ('navigation', 0.065), ('angle', 0.063), ('ae', 0.061), ('movement', 0.061), ('uncertainty', 0.06), ('human', 0.059), ('sb', 0.057), ('lose', 0.057), ('si', 0.056), ('humans', 0.055), ('obstacles', 0.054), ('rays', 0.05), ('serve', 0.05), ('control', 0.048), ('qi', 0.048), ('virtual', 0.048), ('policies', 0.045), ('perception', 0.044), ('expected', 0.044), ('allocation', 0.043), ('maxa', 0.043), ('perceptual', 0.043), ('gray', 0.043), ('return', 0.04), ('resource', 0.04), ('allocating', 0.039), ('ballard', 0.039), ('lossb', 0.039), ('saccades', 0.039), ('targeting', 0.039), ('inaccurate', 0.038), ('access', 0.038), ('minimization', 0.037), ('consequences', 0.037), ('retrieve', 0.037), ('sensory', 0.036), ('selection', 0.036), ('task', 0.035), ('targets', 0.034), ('sarsa', 0.034), ('bel', 0.034), ('humanoid', 0.034), ('active', 0.033), ('lter', 0.031), ('protocol', 0.031), ('routines', 0.031), ('dana', 0.031), ('distance', 0.03), ('perfect', 0.03), ('estimates', 0.03), ('units', 0.03), ('dopamine', 0.028), ('blocked', 0.028), ('routine', 0.028), ('select', 0.028), ('arg', 0.027), ('navigating', 0.027), ('goals', 0.027), ('continuous', 0.026), ('vision', 0.026), ('highest', 0.026), ('multiple', 0.026), ('guided', 0.024), ('demands', 0.024), ('environment', 0.024), ('relevant', 0.024), ('purpose', 0.024), ('dence', 0.023), ('propagate', 0.023), ('expresses', 0.023), ('bene', 0.023), ('environmental', 0.023), ('composite', 0.023), ('handling', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 68 nips-2003-Eye Movements for Reward Maximization

Author: Nathan Sprague, Dana Ballard

2 0.21277717 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efﬁcient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman ﬁltering to allow an agent to construct a good training signal and learn an effective policy. 1

3 0.1802702 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

Author: Kazuyuki Samejima, Kenji Doya, Yasumasa Ueda, Minoru Kimura

Abstract: When we model a higher order functions, such as learning and memory, we face a difﬁculty of comparing neural activities with hidden variables that depend on the history of sensory and motor signals and the dynamics of the network. Here, we propose novel method for estimating hidden variables of a learning agent, such as connection weights from sequences of observable variables. Bayesian estimation is a method to estimate the posterior probability of hidden variables from observable data sequence using a dynamic model of hidden and observable variables. In this paper, we apply particle ﬁlter for estimating internal parameters and metaparameters of a reinforcement learning model. We veriﬁed the effectiveness of the method using both artiﬁcial data and real animal behavioral data. 1

4 0.13205567 67 nips-2003-Eye Micro-movements Improve Stimulus Detection Beyond the Nyquist Limit in the Peripheral Retina

Author: Matthias H. Hennig, Florentin Wörgötter

Abstract: Even under perfect ﬁxation the human eye is under steady motion (tremor, microsaccades, slow drift). The “dynamic” theory of vision [1, 2] states that eye-movements can improve hyperacuity. According to this theory, eye movements are thought to create variable spatial excitation patterns on the photoreceptor grid, which will allow for better spatiotemporal summation at later stages. We reexamine this theory using a realistic model of the vertebrate retina by comparing responses of a resting and a moving eye. The performance of simulated ganglion cells in a hyperacuity task is evaluated by ideal observer analysis. We ﬁnd that in the central retina eye-micromovements have no effect on the performance. Here optical blurring limits vernier acuity. In the retinal periphery however, eye-micromovements clearly improve performance. Based on ROC analysis, our predictions are quantitatively testable in electrophysiological and psychophysical experiments. 1

5 0.12604265 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

Author: Georgios Theocharous, Leslie P. Kaelbling

Abstract: Recent research has demonstrated that useful POMDP solutions do not require consideration of the entire belief space. We extend this idea with the notion of temporal abstraction. We present and explore a new reinforcement learning algorithm over grid-points in belief space, which uses macro-actions and Monte Carlo updates of the Q-values. We apply the algorithm to a large scale robot navigation task and demonstrate that with temporal abstraction we can consider an even smaller part of the belief space, we can learn POMDP policies faster, and we can do information gathering more efﬁciently.

6 0.10501403 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

7 0.10248181 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

8 0.10212676 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

9 0.098320507 78 nips-2003-Gaussian Processes in Reinforcement Learning

10 0.097699046 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

11 0.095240779 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors

12 0.093246028 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

13 0.092405766 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

14 0.08747422 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

15 0.076599486 62 nips-2003-Envelope-based Planning in Relational MDPs

16 0.075834796 55 nips-2003-Distributed Optimization in Adaptive Networks

17 0.072754793 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

18 0.072572723 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

19 0.071173742 158 nips-2003-Policy Search by Dynamic Programming

20 0.065498404 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.195), (1, 0.275), (2, -0.041), (3, -0.036), (4, -0.057), (5, 0.014), (6, -0.002), (7, -0.012), (8, -0.118), (9, -0.1), (10, 0.044), (11, 0.051), (12, 0.009), (13, 0.033), (14, 0.068), (15, 0.152), (16, 0.007), (17, 0.003), (18, -0.047), (19, 0.05), (20, -0.041), (21, -0.038), (22, 0.116), (23, -0.014), (24, 0.045), (25, 0.028), (26, 0.076), (27, 0.04), (28, 0.005), (29, 0.105), (30, 0.059), (31, 0.037), (32, -0.05), (33, -0.062), (34, -0.123), (35, -0.011), (36, -0.11), (37, -0.012), (38, -0.028), (39, -0.011), (40, -0.112), (41, -0.042), (42, -0.007), (43, -0.036), (44, 0.057), (45, -0.053), (46, -0.053), (47, -0.012), (48, -0.115), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95534819 68 nips-2003-Eye Movements for Reward Maximization

Author: Nathan Sprague, Dana Ballard

2 0.67810786 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

3 0.65370744 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

Author: Marc Toussaint

Abstract: We present a connectionist architecture that can learn a model of the relations between perceptions and actions and use this model for behavior planning. State representations are learned with a growing selforganizing layer which is directly coupled to a perception and a motor layer. Knowledge about possible state transitions is encoded in the lateral connectivity. Motor signals modulate this lateral connectivity and a dynamic ﬁeld on the layer organizes a planning process. All mechanisms are local and adaptation is based on Hebbian ideas. The model is continuous in the action, perception, and time domain.

4 0.60756421 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

Author: Saori C. Tanaka, Kenji Doya, Go Okada, Kazutaka Ueda, Yasumasa Okamoto, Shigeto Yamawaki

Abstract: To understand the brain mechanisms involved in reward prediction on different time scales, we developed a Markov decision task that requires prediction of both immediate and future rewards, and analyzed subjects’ brain activities using functional MRI. We estimated the time course of reward prediction and reward prediction error on different time scales from subjects' performance data, and used them as the explanatory variables for SPM analysis. We found topographic maps of different time scales in medial frontal cortex and striatum. The result suggests that different cortico-basal ganglia loops are specialized for reward prediction on different time scales. 1 Intro du ction In our daily life, we make decisions based on the prediction of rewards on different time scales; immediate and long-term effects of an action are often in conflict, and biased evaluation of immediate or future outcome can lead to pathetic behaviors. Lesions in the central serotonergic system result in impulsive behaviors in humans [1], and animals [2, 3], which can be attributed to deficits in reward prediction on a long time scale. Damages in the ventral part of medial frontal cortex (MFC) also cause deficits in decision-making that requires assessment of future outcomes [4-6]. A possible mechanism underlying these observations is that different brain areas are specialized for reward prediction on different time scales, and that the ascending serotonergic system activates those specialized for predictions in longer time scales [7]. The theoretical framework of temporal difference (TD) learning [8] successfully explains reward-predictive activities of the midbrain dopaminergic system as well as those of the cortex and the striatum [9-13]. In TD learning theory, the predicted amount of future reward starting from a state s(t) is formulated as the “value function” V(t) = E[r(t + 1) + γ r(t + 2) + γ 2r(t + 3) + …] (1) and learning is based on the TD error δ(t) = r(t) + γ V(t) – V(t - 1). (2) The ‘discount factor’ γ controls the time scale of prediction; while only the immediate reward r(t + 1) is considered with γ = 0, rewards in the longer future are taken into account with γ closer to 1. In order to test the above hypothesis [7], we developed a reinforcement learning task which requires a large value of discount factor for successful performance, and analyzed subjects’ brain activities using functional MRI. In addition to conventional block-design analysis, a novel model-based regression analysis revealed topographic representation of prediction time scale with in the cortico-basal ganglia loops. 2 2.1 Methods Markov Decision Task In the Markov decision task (Fig. 1), markers on the corners of a square present four states, and the subject selects one of two actions by pressing a button (a1 = left button, a2 = right button) (Fig. 1A). The action determines both the amount of reward and the movement of the marker (Fig. 1B). In the REGULAR condition, the next trial is started from the marker position at the end of the previous trial. Therefore, in order to maximize the reward acquired in a long run, the subject has to select an action by taking into account both the immediate reward and the future reward expected from the subsequent state. The optimal behavior is to receive small negative rewards at states s 2, s3, and s4 to obtain a large positive reward at state s1 (Fig. 1C). In the RANDOM condition, next trial is started from a random marker position so that the subject has to consider only immediate reward. Thus, the optimal behavior is to collect a larger reward at each state (Fig. 1D). In the baseline condition (NO condition), the reward is always zero. In order to learn the optimal behaviors, the discount factor γ has to be larger than 0.3425 in REGULAR condition, while it can be arbitrarily small in RANDOM condition. 2.2 fMRI imaging Eighteen healthy, right-handed volunteers (13 males and 5 females), gave informed consent to take part in the study, with the approval of the ethics and safety committees of ATR and Hiroshima University. A 0 Time 1.0 2.0 2.5 3.0 100 C B +r 2 s2 s1 REGULAR condition s2 -r 1 -r 2 +r 1 s1 100 D RANDOM condition +r 2 s2 s1 -r 1 +r 1 -r 2 -r 1 s4 +r 2 4.0 (s) -r 1 s3 a1 a2 r1 = 20 10 yen r2 = 100 10 yen +r 1 -r 1 s4 -r 1 -r 1 s3 s4 -r 1 s3 Fig. 1. (A) Sequence of stimulus and response events in the Markov decision task. First, one of four squares representing present state turns green (0s). As the fixation point turns green (1s), the subject presses either the right or left button within 1 second. After 1s delay, the green square changes its position (2s), and then a reward for the current action is presented by a number (2.5s) and a bar graph showing cumulative reward during the block is updated (3.0s). One trial takes four seconds. Subjects performed five trials in the NO condition, 32 trials in the RANDOM condition, five trials in the NO condition, and 32 trials in the REGULAR condition in one block. They repeated four blocks; thus, the entire experiment consisted of 312 trials, taking about 20 minutes. (B) The rule of the reward and marker movement. (C) In the REGULAR condition, the optimal behavior is to receive small negative rewards –r 1 (-10, -20, or -30 yen) at states s2, s3, and s4 to obtain a large positive reward +r2 (90, 100, or 110 yen) at state s1. (D) In the RANDOM condition, the next trial is started from random state. Thus, the optimal behavior is to select a larger reward at each state. A 1.5-Tesla scanner (Marconi, MAGNEX ECLIPSE, Japan) was used to acquire both structural T1-weighted images (TR = 12 s, TE = 450 ms, flip angle = 20 deg, matrix = 256 × 256, FoV = 256 mm, thickness = 1 mm, slice gap = 0 mm ) and T2*-weighted echo planar images (TR = 4 s, TE = 55 msec, flip angle = 90 deg, 38 transverse slices, matrix = 64 × 64, FoV = 192 mm, thickness = 4 mm, slice gap = 0 mm, slice gap = 0 mm) with blood oxygen level-dependent (BOLD) contrast. 2.3 Data analysis The data were preprocessed and analyzed with SPM99 (Friston et al., 1995; Wellcome Department of Cognitive Neurology, London, UK). The first three volumes of images were discarded to avoid T1 equilibrium effects. The images were realigned to the first image as a reference, spatially normalized with respect to the Montreal Neurological Institute EPI template, and spatially smoothed with a Gaussian kernel (8 mm, full-width at half-maximum). A RANDOM condition action larger reward Fig. 2. The selected action of a representative single subject (solid line) and the group average ratio of selecting optimal action (dashed line) in (A) RANDOM and (B) REGULAR conditions. smaller reward 1 32 64 96 128 96 128 trial REGULAR condition B action optimal nonoptimal 1 32 64 trial Images of parameter estimates for the contrast of interest were created for each subject. These were then used for a second-level group analysis using a one-sample t-test across the subjects (random effects analysis). We conducted two types of analysis. One was block design analysis using three boxcar regressors convolved with a hemodynamic response function as the reference waveform for each condition (RANDOM, REGULAR, and NO). The other was multivariate regression analysis using explanatory variables, representing the time course of the reward prediction V(t) and reward prediction error δ(t) estimated from subjects’ performance data (described below), in addition to three regressors representing the condition of the block. 2.4 Estimation of predicted reward V(t) and prediction error δ(t) The time course of reward prediction V(t) and reward prediction error δ(t) were estimated from each subject’s performance data, i.e. state s(t), action a(t), and reward r(t), as follows. If the subject starts from a state s(t) and comes back to the same state after k steps, the expected cumulative reward V(t) should satisfy the consistency condition V(t) = r(t + 1) + γ r(t + 2) + … + γ k-1 r(t + k) + γ kV(t). (3) Thus, for each time t of the data file, we calculated the weighted sum of the rewards acquired until the subject returned to the same state and estimated the value function for that episode as  r ( t + 1) + γ r ( t + 2 ) + ... + γ k −1r ( t + k )  . ˆ (t ) =  V 1− γ k (4) The estimate of the value function V(t) at time t was given by the average of all previous episodes from the same state as at time t V (t ) = 1 L L ∑ Vˆ ( t ) , l (5) l =1 where {t1, …, tL} are the indices of time visiting the same state as s(t), i.e. s(t1) = … = s(tL) = s(t). The TD error was given by the difference between the actual reward r(t) and the temporal difference of the value function V(t) according to equation (2). Assuming that different brain areas are involved in reward prediction on different time scales, we varied the discount factor γ as 0, 0.3, 0.6, 0.8, 0.9, and 0.99. Fig. 3. (A) In REGULAR vs. RANDOM comparison, significant activation was observed in DLPFC ((x, y, z) = (46, 45, 9), peak t = 4.06) (p < 0.001 uncorrected). (B) In RANDOM vs. REGULAR comparison, significant activation was observed in lateral OFC ((x, y, z) = (-32, 9, -21), peak t = 4.90) (p < 0.001 uncorrected). 3 3.1 R e sul t s Behavioral results Figure 2 summarizes the learning performance of a representative single subject (solid line) and group average (dashed line) during fMRI measurement. Fourteen subjects successfully learned to take larger immediate rewards in the RANDOM condition (Fig. 2A) and a large positive reward at s1 after small negative rewards at s2, s3 and s4 in the REGULAR condition (Fig. 2B). 3.2 Block-design analysis In REGULAR vs. RANDOM contrast, we observed a significant activation in the dorsolateral prefrontal cortex (DLPFC) (Fig. 3A) (p < 0.001 uncorrected). In RANDOM vs. REGULAR contrast, we observed a significant activation in lateral orbitofrontal cortex (lOFC) (Fig. 3B) (p < 0.001 uncorrected). The result of block-design analysis suggests differential involvement of neural pathways in reward prediction on long and short time scales. The result in RANDOM vs. REGULAR contrast was consistent with previous studies that the OFC is involved in reward prediction within a short delay and reward outcome [14-20]. 3.3 Regression analysis We observed significant correlation with reward prediction V(t) in the MFC, DLPFC (all γ ), ventromedial insula (small γ ), dorsal striatum, amygdala, hippocampus, and parahippocampal gyrus (large γ ) (p < 0.001 uncorrected) (Fig. 4A). We also found significant correlation with reward prediction error δ(t) in the IPC, PMd, cerebellum (all γ ), ventral striatum (small γ ), and lateral OFC (large γ ) (p < 0.001 uncorrected) (Fig. 4B). As we changed the time scale parameter γ of reward prediction, we found rostro-caudal maps of correlation to V(t) in MFC with increasing γ. Fig. 4. Voxels with a significant correlation (p < 0.001 uncorrected) with reward prediction V(t) and prediction error δ(t) are shown in different colors for different settings of the time scale parameter (γ = 0 in red, γ = 0.3 in orange, γ = 0.6 in yellow, γ = 0.8 in green, γ = 0.9 in cyan, and γ = 0.99 in blue). Voxels correlated with two or more regressors are shown by a mosaic of colors. (A) Significant correlation with reward prediction V(t) was observed in the MFC, DLPFC, dorsal striatum, insula, and hippocampus. Note the anterior-ventral to posterior-dorsal gradient with the increase in γ in the MFC. (B) Significant correlation with reward prediction error δ(t) on γ = 0 was observed in the ventral striatum. 4 D i s c u ss i o n In the MFC, anterior and ventral part was involved in reward prediction V(t) on shorter time scales (0 ≤ γ ≤ 0.6), whereas posterior and dorsal part was involved in reward prediction V(t) on longer time scales (0.6 ≤ γ ≤ 0.99). The ventral striatum involved in reward prediction error δ(t) on shortest time scale (γ = 0), while the dorsolateral striatum correlated with reward prediction V(t) on longer time scales (0.9 ≤ γ ≤ 0.99). These results are consistent with the topographic organization of fronto-striatal connection; the rostral part of the MFC project to the ventral striatum, whereas the dorsal and posterior part of the cingulate cortex project to the dorsolateral striatum [21]. In the MFC and the striatum, no significant difference in activity was observed in block-design analysis while we did find graded maps of activities with different values of γ. A possible reason is that different parts of the MFC and the striatum are concurrently involved with reward prediction on different time scales, regardless of the task context. Activities of the DLPFC and lOFC, which show significant differences in block-design analysis (Fig. 3), may be regulated according to the necessity for the task; From these results, we propose the following mechanism of reward prediction on different time scales. The parallel cortico-basal ganglia loops are responsible for reward prediction on various time scales. The ‘limbic loop’ via the ventral striatum specializes in immediate reward prediction, whereas the ‘cognitive and motor loop’ via the dorsal striatum specialises in future reward prediction. Each loop learns to predict rewards on its specific time scale. To perform an optimal action under a given time scale, the output of the loop with an appropriate time scale is used for actual action selection. Previous studies in brain damages and serotonergic functions suggest that the MFC and the dorsal raphe, which are reciprocally connected [22, 23], play an important role in future reward prediction. The cortico-cortico projections from the MFC, or the serotonergic projections from the dorsal raphe to the cortex and the striatum may be involved in the modulation of these parallel loops. In present study, using a novel regression analysis based on subjects’ performance data and reinforcement learning model, we revealed the maps of time scales in reward prediction, which could not be found by conventional block-design analysis. Future studies using this method under pharmacological manipulation of the serotonergic system would clarify the role of serotonin in regulating the time scale of reward prediction. Acknowledgments We thank Nicolas Schweighofer, Kazuyuki Samejima, Masahiko Haruno, Hiroshi Imamizu, Satomi Higuchi, Toshinori Yoshioka, and Mitsuo Kawato for helpful discussions and technical advice. References [1] Rogers, R.D., et al. (1999) Dissociable deficits in the decision-making cognition of chronic amphetamine abusers, opiate abusers, patients with focal damage to prefrontal cortex, and tryptophan-depleted normal volunteers: evidence for monoaminergic mechanisms. Neuropsychopharmacology 20(4):322-339. [2] Evenden, J.L. & Ryan, C.N. (1996) The pharmacology of impulsive behaviour in rats: the effects of drugs on response choice with varying delays of reinforcement. Psychopharmacology (Berl) 128(2):161-170. [3] Mobini, S., et al. (2000) Effects of central 5-hydroxytryptamine depletion on sensitivity to delayed and probabilistic reinforcement. Psychopharmacology (Berl) 152(4):390-397. [4] Bechara, A., et al. (1994) Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50(1-3):7-15. [5] Bechara, A., Tranel, D. & Damasio, H. (2000) Characterization of the decision-making deficit of patients with ventromedial prefrontal cortex lesions. Brain 123:2189-2202. [6] Mobini, S., et al. (2002) Effects of lesions of the orbitofrontal cortex on sensitivity to delayed and probabilistic reinforcement. Psychopharmacology (Berl) 160(3):290-298. [7] Doya, K. (2002) 15(4-6):495-506. Metalearning and neuromodulation. Neural Netw [8] Sutton, R.S., Barto, A. G. (1998) Reinforcement learning. Cambridge, MA: MIT press. [9] Houk, J.C., Adams, J.L. & Barto, A.G., A model of how the basal ganglia generate and use neural signals that predict reinforcement, in Models of information processing in the basal ganglia, J.C. Houk, J.L. Davis, and D.G. Beiser, Editors. 1995, MIT Press: Cambridge, Mass. p. 249-270. [10] Schultz, W., Dayan, P. & Montague, P.R. (1997) A neural substrate of prediction and reward. Science 275(5306):1593-1599. [11] Doya, K. (2000) Complementary roles of basal ganglia and cerebellum in learning and motor control. Curr Opin Neurobiol 10(6):732-739. [12] Berns, G.S., et al. (2001) Predictability modulates human brain response to reward. J Neurosci 21(8):2793-2798. [13] O'Doherty, J.P., et al. (2003) Temporal difference models and reward-related learning in the human brain. Neuron 38(2):329-337. [14] Koepp, M.J., et al. (1998) Evidence for striatal dopamine release during a video game. Nature 393(6682):266-268. [15] Rogers, R.D., et al. (1999) Choosing between small, likely rewards and large, unlikely rewards activates inferior and orbital prefrontal cortex. J Neurosci 19(20):9029-9038. [16] Elliott, R., Friston, K.J. & Dolan, R.J. (2000) Dissociable neural responses in human reward systems. J Neurosci 20(16):6159-6165. [17] Breiter, H.C., et al. (2001) Functional imaging of neural responses to expectancy and experience of monetary gains and losses. Neuron 30(2):619-639. [18] Knutson, B., et al. (2001) Anticipation of increasing monetary reward selectively recruits nucleus accumbens. J Neurosci 21(16):RC159. [19] O'Doherty, J.P., et al. (2002) Neural responses during anticipation of a primary taste reward. Neuron 33(5):815-826. [20] Pagnoni, G., et al. (2002) Activity in human ventral striatum locked to errors of reward prediction. Nat Neurosci 5(2):97-98. [21] Haber, S.N., et al. (1995) The orbital and medial prefrontal circuit through the primate basal ganglia. J Neurosci 15(7 Pt 1):4851-4867. [22] Celada, P., et al. (2001) Control of dorsal raphe serotonergic neurons by the medial prefrontal cortex: Involvement of serotonin-1A, GABA(A), and glutamate receptors. J Neurosci 21(24):9917-9929. [23] Martin-Ruiz, R., et al. (2001) Control of serotonergic function in medial prefrontal cortex by serotonin-2A receptors through a glutamate-dependent mechanism. J Neurosci 21(24):9856-9866.

5 0.59733039 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

Author: Kazuyuki Samejima, Kenji Doya, Yasumasa Ueda, Minoru Kimura

6 0.49079815 67 nips-2003-Eye Micro-movements Improve Stimulus Detection Beyond the Nyquist Limit in the Peripheral Retina

7 0.48011118 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

8 0.4520514 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

9 0.42945188 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

10 0.41312686 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors

11 0.39174888 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

12 0.38209286 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

13 0.38072515 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

14 0.37818855 62 nips-2003-Envelope-based Planning in Relational MDPs

15 0.36549571 43 nips-2003-Bounded Invariance and the Formation of Place Fields

16 0.36438993 78 nips-2003-Gaussian Processes in Reinforcement Learning

17 0.35833228 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

18 0.35655433 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

19 0.35498592 158 nips-2003-Policy Search by Dynamic Programming

20 0.32536891 2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.045), (11, 0.024), (29, 0.022), (30, 0.025), (35, 0.062), (53, 0.086), (69, 0.016), (71, 0.042), (76, 0.039), (85, 0.091), (86, 0.276), (91, 0.155), (99, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81267536 68 nips-2003-Eye Movements for Reward Maximization

Author: Nathan Sprague, Dana Ballard

2 0.61425734 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

3 0.6121279 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

Author: Maneesh Sahani

Abstract: Signiﬁcant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a signiﬁcant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1

4 0.60377151 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

Author: Gerald Tesauro

Abstract: Recent multi-agent extensions of Q-Learning require knowledge of other agents’ payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. This paper proposes a fundamentally different approach, dubbed “Hyper-Q” Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents’ strategies are estimated from observed actions via Bayesian inference. Hyper-Q may be effective against many different types of adaptive agents, even if they are persistently dynamic. Against certain broad categories of adaptation, it is argued that Hyper-Q may converge to exact optimal time-varying policies. In tests using Rock-Paper-Scissors, Hyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent (IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary analysis of Hyper-Q against itself is also presented. 1

5 0.60093033 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

6 0.59396899 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

7 0.59079552 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

8 0.58995581 107 nips-2003-Learning Spectral Clustering

9 0.58711529 55 nips-2003-Distributed Optimization in Adaptive Networks

10 0.58408821 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

11 0.58401614 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

12 0.58394712 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

13 0.58384156 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

14 0.58368659 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

15 0.58353817 158 nips-2003-Policy Search by Dynamic Programming

16 0.58320296 113 nips-2003-Learning with Local and Global Consistency

17 0.58285183 73 nips-2003-Feature Selection in Clustering Problems

18 0.58271956 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

19 0.58141965 30 nips-2003-Approximability of Probability Distributions

20 0.58048546 81 nips-2003-Geometric Analysis of Constrained Curves