nips nips2005 nips2005-26 knowledge-graph by maker-knowledge-mining

26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity


Source: pdf

Author: Samuel M. McClure, Mark S. Gilzenrat, Jonathan D. Cohen

Abstract: We propose a model by which dopamine (DA) and norepinepherine (NE) combine to alternate behavior between relatively exploratory and exploitative modes. The model is developed for a target detection task for which there is extant single neuron recording data available from locus coeruleus (LC) NE neurons. An exploration-exploitation trade-off is elicited by regularly switching which of the two stimuli are rewarded. DA functions within the model to change synaptic weights according to a reinforcement learning algorithm. Exploration is mediated by the state of LC firing, with higher tonic and lower phasic activity producing greater response variability. The opposite state of LC function, with lower baseline firing rate and greater phasic responses, favors exploitative behavior. Changes in LC firing mode result from combined measures of response conflict and reward rate, where response conflict is monitored using models of anterior cingulate cortex (ACC). Increased long-term response conflict and decreased reward rate, which occurs following reward contingency switch, favors the higher tonic state of LC function and NE release. This increases exploration, and facilitates discovery of the new target. 1 In t rod u ct i on A central problem in reinforcement learning is determining how to adaptively move between exploitative and exploratory behaviors in changing environments. We propose a set of neurophysiologic mechanisms whose interaction may mediate this behavioral shift. Empirical work on the midbrain dopamine (DA) system has suggested that this system is particularly well suited for guiding exploitative behaviors. This hypothesis has been reified by a number of studies showing that a temporal difference (TD) learning algorithm accounts for activity in these neurons in a wide variety of behavioral tasks [1,2]. DA release is believed to encode a reward prediction error signal that acts to change synaptic weights relevant for producing behaviors [3]. Through learning, this allows neural pathways to predict future expected reward through the relative strength of their synaptic connections [1]. Decision-making procedures based on these value estimates are necessarily greedy. Including reward bonuses for exploratory choices supports non-greedy actions [4] and accounts for additional data derived from DA neurons [5]. We show that combining a DA learning algorithm with models of response conflict detection [6] and NE function [7] produces an effective annealing procedure for alternating between exploration and exploitation. NE neurons within the LC alternate between two firing modes [8]. In the first mode, known as the phasic mode, NE neurons fire at a low baseline rate but have relatively robust phasic responses to behaviorally salient stimuli. The second mode, called the tonic mode, is associated with a higher baseline firing and absent or attenuated phasic responses. The effects of NE on efferent areas are modulatory in nature, and are well captured as a change in the gain of efferent inputs so that neuronal responses are potentiated in the presence of NE [9]. Thus, in phasic mode, the LC provides transient facilitation in processing, time-locked to the presence of behaviorally salient information in motor or decision areas. Conversely, in tonic mode, higher overall LC discharge rate increases gain generally and hence increases the probability of arbitrary responding. Consistent with this account, for periods when NE neurons are in the phasic mode, monkey performance is nearly perfect. However, when NE neurons are in the tonic mode, performance is more erratic, with increased response times and error rate [8]. These findings have led to a recent characterization of the LC as a dynamic temporal filter, adjusting the system's relative responsivity to salient and irrelevant information [8]. In this way, the LC is ideally positioned to mediate the shift between exploitative and exploratory behavior. The parameters that underlie changes in LC firing mode remain largely unexplored. Based on data from a target detection task by Aston-Jones and colleagues [10], we propose that LC firing mode is determined in part by measures of response conflict and reward rate as calculated by the ACC and OFC, respectively [8]. Together, the ACC and OFC are the principle sources of cortical input to the LC [8]. Activity in the ACC is known, largely through human neuroimaging experiments, to change in accord with response conflict [6]. In brief, relatively equal activity in competing behavioral responses (reflecting uncertainty) produces high conflict. Low conflict results when one behavioral response predominates. We propose that increased long-term response conflict biases the LC towards a tonic firing mode. Increased conflict necessarily follows changes in reward contingency. As the previously rewarded target no longer produces reward, there will be a relative increase in response ambiguity and hence conflict. This relationship between conflict and LC firing is analogous to other modeling work [11], which proposes that increased tonic firing reflects increased environmental uncertainty. As a final component to our model, we hypothesize that the OFC maintains an ongoing estimate in reward rate, and that this estimate of reward rate also influences LC firing mode. As reward rate increases, we assume that the OFC tends to bias the LC in favor of phasic firing to target stimuli. We have aimed to fix model parameters based on previous work using simpler networks. We use parameters derived primarily from a previous model of the LC by Gilzenrat and colleagues [7]. Integration of response conflict by the ACC and its influence on LC firing was borrowed from unpublished work by Gilzenrat and colleagues in which they fit human behavioral data in a diminishing utilities task. Given this approach, we interpret our observed improvement in model performance with combined NE and DA function as validation of a mechanism for automatically switching between exploitative and exploratory action selection. 2 G o- No- G o Task and Core Mod el We have modeled an experiment in which monkeys performed a target detection task [10]. In the task, monkeys were shown either a vertical bar or a horizontal bar and were required to make or omit a motor response appropriately. Initially, the vertical bar was the target stimulus and correctly responding was rewarded with a squirt of fruit juice (r=1 in the model). Responding to the non-target horizontal stimulus resulted in time out punishment (r=-.1; Figure 1A). No responses to either the target or non-target gave zero reward. After the monkeys had fully acquired the task, the experimenters periodically switched the reward contingency such that the previously rewarded stimulus (target) became the distractor, and vice versa. Following such reversals, LC neurons were observed to change from emitting phasic bursts of firing to the target, to tonic firing following the switch, and slowly back to phasic firing for the new target as the new response criteria was obtained [10]. Figure 1: Task and model design. (A) Responses were required for targets in order to obtain reward. Responses to distractors resulted in a minor punishment. No responses gave zero reward. (B) In the model, vertical and horizontal bar inputs (I1 and I 2 ) fed to integrator neurons (X1 and X2 ) which then drove response units (Y1 and Y2 ). Responses were made if Y 1 or Y2 crossed a threshold while input units were active. We have previously modeled this task [7,12] with a three-layer connectionist network in which two input units, I1 and I 2 , corresponding to the vertical and horizontal bars, drive two mutually inhibitory integrator units, X1 and X2 . The integrator units subsequently feed two response units, Y1 and Y2 (Figure 1B). Responses are made whenever output from Y1 or Y2 crosses a threshold level of activity, θ. Relatively weak cross connections from each input unit to the opposite integrator unit (I1 to X2 and I 2 to X1 ) are intended to model stimulus similarity. Both the integrator and response units were modeled as noisy, leaky accumulators: ˙ X i =

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a model by which dopamine (DA) and norepinepherine (NE) combine to alternate behavior between relatively exploratory and exploitative modes. [sent-7, score-0.326]

2 The model is developed for a target detection task for which there is extant single neuron recording data available from locus coeruleus (LC) NE neurons. [sent-8, score-0.205]

3 Exploration is mediated by the state of LC firing, with higher tonic and lower phasic activity producing greater response variability. [sent-11, score-0.626]

4 The opposite state of LC function, with lower baseline firing rate and greater phasic responses, favors exploitative behavior. [sent-12, score-0.664]

5 Changes in LC firing mode result from combined measures of response conflict and reward rate, where response conflict is monitored using models of anterior cingulate cortex (ACC). [sent-13, score-1.888]

6 Increased long-term response conflict and decreased reward rate, which occurs following reward contingency switch, favors the higher tonic state of LC function and NE release. [sent-14, score-1.312]

7 1 In t rod u ct i on A central problem in reinforcement learning is determining how to adaptively move between exploitative and exploratory behaviors in changing environments. [sent-16, score-0.214]

8 Empirical work on the midbrain dopamine (DA) system has suggested that this system is particularly well suited for guiding exploitative behaviors. [sent-18, score-0.202]

9 This hypothesis has been reified by a number of studies showing that a temporal difference (TD) learning algorithm accounts for activity in these neurons in a wide variety of behavioral tasks [1,2]. [sent-19, score-0.223]

10 DA release is believed to encode a reward prediction error signal that acts to change synaptic weights relevant for producing behaviors [3]. [sent-20, score-0.342]

11 Through learning, this allows neural pathways to predict future expected reward through the relative strength of their synaptic connections [1]. [sent-21, score-0.28]

12 Including reward bonuses for exploratory choices supports non-greedy actions [4] and accounts for additional data derived from DA neurons [5]. [sent-23, score-0.45]

13 We show that combining a DA learning algorithm with models of response conflict detection [6] and NE function [7] produces an effective annealing procedure for alternating between exploration and exploitation. [sent-24, score-0.715]

14 NE neurons within the LC alternate between two firing modes [8]. [sent-25, score-0.321]

15 In the first mode, known as the phasic mode, NE neurons fire at a low baseline rate but have relatively robust phasic responses to behaviorally salient stimuli. [sent-26, score-0.678]

16 The second mode, called the tonic mode, is associated with a higher baseline firing and absent or attenuated phasic responses. [sent-27, score-0.605]

17 The effects of NE on efferent areas are modulatory in nature, and are well captured as a change in the gain of efferent inputs so that neuronal responses are potentiated in the presence of NE [9]. [sent-28, score-0.237]

18 Thus, in phasic mode, the LC provides transient facilitation in processing, time-locked to the presence of behaviorally salient information in motor or decision areas. [sent-29, score-0.234]

19 Conversely, in tonic mode, higher overall LC discharge rate increases gain generally and hence increases the probability of arbitrary responding. [sent-30, score-0.301]

20 Consistent with this account, for periods when NE neurons are in the phasic mode, monkey performance is nearly perfect. [sent-31, score-0.275]

21 However, when NE neurons are in the tonic mode, performance is more erratic, with increased response times and error rate [8]. [sent-32, score-0.495]

22 In this way, the LC is ideally positioned to mediate the shift between exploitative and exploratory behavior. [sent-34, score-0.215]

23 The parameters that underlie changes in LC firing mode remain largely unexplored. [sent-35, score-0.4]

24 Based on data from a target detection task by Aston-Jones and colleagues [10], we propose that LC firing mode is determined in part by measures of response conflict and reward rate as calculated by the ACC and OFC, respectively [8]. [sent-36, score-1.464]

25 Activity in the ACC is known, largely through human neuroimaging experiments, to change in accord with response conflict [6]. [sent-38, score-0.616]

26 In brief, relatively equal activity in competing behavioral responses (reflecting uncertainty) produces high conflict. [sent-39, score-0.282]

27 We propose that increased long-term response conflict biases the LC towards a tonic firing mode. [sent-41, score-1.062]

28 As the previously rewarded target no longer produces reward, there will be a relative increase in response ambiguity and hence conflict. [sent-43, score-0.324]

29 This relationship between conflict and LC firing is analogous to other modeling work [11], which proposes that increased tonic firing reflects increased environmental uncertainty. [sent-44, score-1.164]

30 As a final component to our model, we hypothesize that the OFC maintains an ongoing estimate in reward rate, and that this estimate of reward rate also influences LC firing mode. [sent-45, score-0.86]

31 As reward rate increases, we assume that the OFC tends to bias the LC in favor of phasic firing to target stimuli. [sent-46, score-0.818]

32 Integration of response conflict by the ACC and its influence on LC firing was borrowed from unpublished work by Gilzenrat and colleagues in which they fit human behavioral data in a diminishing utilities task. [sent-49, score-0.999]

33 Given this approach, we interpret our observed improvement in model performance with combined NE and DA function as validation of a mechanism for automatically switching between exploitative and exploratory action selection. [sent-50, score-0.266]

34 2 G o- No- G o Task and Core Mod el We have modeled an experiment in which monkeys performed a target detection task [10]. [sent-51, score-0.172]

35 In the task, monkeys were shown either a vertical bar or a horizontal bar and were required to make or omit a motor response appropriately. [sent-52, score-0.357]

36 Initially, the vertical bar was the target stimulus and correctly responding was rewarded with a squirt of fruit juice (r=1 in the model). [sent-53, score-0.298]

37 After the monkeys had fully acquired the task, the experimenters periodically switched the reward contingency such that the previously rewarded stimulus (target) became the distractor, and vice versa. [sent-57, score-0.502]

38 Following such reversals, LC neurons were observed to change from emitting phasic bursts of firing to the target, to tonic firing following the switch, and slowly back to phasic firing for the new target as the new response criteria was obtained [10]. [sent-58, score-1.58]

39 (B) In the model, vertical and horizontal bar inputs (I1 and I 2 ) fed to integrator neurons (X1 and X2 ) which then drove response units (Y1 and Y2 ). [sent-63, score-0.646]

40 Responses were made if Y 1 or Y2 crossed a threshold while input units were active. [sent-64, score-0.259]

41 We have previously modeled this task [7,12] with a three-layer connectionist network in which two input units, I1 and I 2 , corresponding to the vertical and horizontal bars, drive two mutually inhibitory integrator units, X1 and X2 . [sent-65, score-0.253]

42 The integrator units subsequently feed two response units, Y1 and Y2 (Figure 1B). [sent-66, score-0.464]

43 Relatively weak cross connections from each input unit to the opposite integrator unit (I1 to X2 and I 2 to X1 ) are intended to model stimulus similarity. [sent-68, score-0.304]

44 Both the integrator and response units were modeled as noisy, leaky accumulators: ˙ X i = "X i +w X i I i I i + w X i I j I j " w X i X j f (X j ) +# i (1) ˙ Yi = "Yi +wYi X i f (X i ) " wYi Y j f (Y j ) +# i . [sent-69, score-0.464]

45 ( ) Response units, Y, were given a positive bias, b, and integrator units were unbiased. [sent-75, score-0.286]

46 Simulation of stimulus presentations involved setting one of the input units to a value of 1. [sent-80, score-0.229]

47 Activation of I 1 and I 2 were alternated and 20 units of model time were allowed between presentations for the integrator and response units to relax to baseline levels of activity. [sent-82, score-0.706]

48 After 50 presentations of I 1 and I 2 the reward contingencies were switched; the model was run through 6 such blocks and reversals. [sent-84, score-0.39]

49 The response during each stimulus presentation was determined by which of the two response units first crossed a threshold of output activity (i. [sent-85, score-0.798]

50 f(Y1 ) > θ), or was a no response if neither unit crossed threshold. [sent-87, score-0.273]

51 A reward unit, r, was included that had activity 0 except at the end of each stimulus presentation when its activity was set equal to the obtained reward outcome. [sent-89, score-0.847]

52 Inhibitory inputs from the response units served as measures of expected reward. [sent-90, score-0.366]

53 The output of dopamine neurons was used to update the weights along the pathway ! [sent-92, score-0.233]

54 Thus, at the end of every stimulus presentation, the weights between response units and DA neurons were updated according to w"Yi (t + 1) = w"Yi (t) + #" (t)Z (Yi ) (5) where the learning rate, λ, was set to 0. [sent-94, score-0.509]

55 This learning rule allowed the weights to converge to the expected reward for selecting each of the two actions. [sent-96, score-0.315]

56 Weights between integrator and response units were updated using the same ! [sent-97, score-0.501]

57 8, sufficient activity never accumulated in the response units to allow discovery to new reward contingencies. [sent-102, score-0.727]

58 As the model learned, the weights along the target pathway obtained a maximum value while those along the distractor pathway obtained a minimum value. [sent-103, score-0.262]

59 The only way the model was able to obtain the new target was by noise pushing the new target response unit above threshold. [sent-105, score-0.363]

60 However, this also resulted in a high rate of responding to non-target stimuli even after learning. [sent-108, score-0.161]

61 In order to reduce responding to the distractor, the threshold had to be raised, which also increased the time required to adapt following reward reversals. [sent-109, score-0.507]

62 The network was initialized with equal preference for responding to input 1 or 2, and generally acquired the initial target faster than after reversals (see Figure 2B). [sent-110, score-0.212]

63 (A) DA neurons, δ , modulated weights from integrator to response units in order to modulate the probability of responding to each input. [sent-117, score-0.582]

64 (B) The model successfully increases and decreases responding to inputs 1 and 2 as reward contingencies reverse. [sent-118, score-0.491]

65 However, the model is unable to simultaneously obtain the new response quickly and maintain a low error rate once the response is learned. [sent-119, score-0.434]

66 When threshold is relatively low (left plot), the model adapts quickly but makes frequent responses to the distractor. [sent-120, score-0.2]

67 At higher threshold, responses are correctly omitted to the distractor, but the model acquires the new response slowly. [sent-121, score-0.282]

68 ) Previous work has shown that these equations, with simple modifications, capture the fundamental aspects of tonic and phasic mode activity in the LC [7]. [sent-124, score-0.57]

69 The model included two inputs to the LC from the integrator units (X1 and X2 ) with modifiable weights. [sent-127, score-0.338]

70 In order to change firing mode, h can be modified so that the dynamics of u depend entirely on the state of the LC or so that the dynamics are independent of state. [sent-131, score-0.277]

71 0, the model is appropriately dampened and can burst sharply and return to a relatively low baseline level of activity (phasic mode). [sent-135, score-0.202]

72 When C is small, the LC receives a fixed level of inhibition, which simultaneously reduces bursting activity and increases baseline activity (tonic mode) [7]. [sent-136, score-0.282]

73 The primary function of the LC in the model is to modify the gain, g, of the response function of the integrator and response units as in equation 3. [sent-137, score-0.707]

74 (9) The value of C was updated after every trial by measures of response conflict and reward rate. [sent-140, score-0.995]

75 Response conflict was calculated as a normalized measure of the energy in the response units during the trial. [sent-141, score-0.723]

76 The conflict during the trial is K= Y1 " Y2 Y1 Y2 (10) which correctly measures energy since Y1 and Y2 are connected with weight –1. [sent-144, score-0.5]

77 Based on previous work [8], we let conflict modify C separately based on a shortterm, KS , and long-term, KL, measure. [sent-147, score-0.448]

78 We let short- and long-term conflict have opposing effect on the firing mode of the LC. [sent-153, score-0.811]

79 When short-term conflict increases, the LC is biased towards phasic firing (increased C). [sent-156, score-0.838]

80 However, when long-term conflict increases this is taken to indicate that the current decision strategy is not working. [sent-158, score-0.446]

81 Therefore, increased long-term conflict biases the LC to the tonic mode so as to increase response volatility. [sent-159, score-0.962]

82 (A) The full model includes a conflict detection unit, K, and a reward rate measure, R, which combine to modify activity in the LC. [sent-161, score-0.933]

83 The LC modifies the gain in the integrator and response units. [sent-162, score-0.372]

84 (B) The benefit of including the LC in the model is insignificant when the response threshold is regularly crossed by noise alone, and hence when the error rate is high. [sent-163, score-0.406]

85 (C) However, when the threshold is greater and error rate lower, NE dramatically improves the rate at which the new reward contingencies are learned after reversal. [sent-164, score-0.514]

86 Reward rate, R, was updated at the end of every trial according to R(T + 1) = (1 " # R )R(T ) + # R r (12) where r is the reward earned on the Tth trial. [sent-165, score-0.376]

87 Increased reward rate was assumed to bias the LC to phasic firing. [sent-166, score-0.507]

88 Reward rate, short-term conflict, and long-term conflict updated C according to C = " (K S )(1# " (K L ))" (R) ! [sent-168, score-0.448]

89 While it is impossible to compare the output of our model with monkey behavior, we can make the qualitative assertion that, as with monkeys, our NE-based annealing model allows for high accuracy (and high threshold) decision-making while preserving adaptability to changes in reward contingencies. [sent-175, score-0.397]

90 In order to better demonstrate this improvement, we fit single exponential curves to the plots of probability of accurately responding to the new target by trial number (as in Figure 3B,C). [sent-176, score-0.227]

91 As can be seen, the model with NE-mediated annealing maintains a relatively fast discovery time even as the threshold becomes relatively large. [sent-178, score-0.22]

92 5 Di scu ssi on We have demonstrated that a model incorporating behavioral and learning effects previously ascribed to DA and NE produces an adaptive mechanism for switching between exploratory and exploitative decision-making. [sent-180, score-0.337]

93 Our model uses measures of response conflict and reward rate to modify LC firing mode, and hence to change network dynamics in favor of more or less volatile behavior. [sent-181, score-1.336]

94 The primary limitation is that the model varies between more or less volatile action selection only over the range of reward relevant to our studied task. [sent-184, score-0.353]

95 Model parameters could be altered on a task-by-task basis to correct this; however, a more general scheme may be accomplished with a mean reward learning algorithm [13]. [sent-185, score-0.28]

96 It has previously been argued that DA neurons may actually emit an average reward TD error [14]. [sent-186, score-0.375]

97 This change may require allowing both short- and long-term reward rate control the LC firing mode (Eq. [sent-187, score-0.757]

98 As the number of alternatives increases, rapid learning something akin to reward bonuses [4,5]. [sent-191, score-0.314]

99 (1997) Conditioned responses of monkey locus coeruleus neurons anticipate acquisition of discriminative behavior in a vigilance task. [sent-273, score-0.267]

100 (2002) Long-term reward prediction in TD models of the dopamine system. [sent-299, score-0.362]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lc', 0.516), ('conflict', 0.411), ('reward', 0.28), ('firing', 0.25), ('da', 0.204), ('response', 0.178), ('phasic', 0.177), ('integrator', 0.152), ('mode', 0.15), ('tonic', 0.139), ('units', 0.134), ('exploitative', 0.12), ('activity', 0.104), ('responding', 0.083), ('dopamine', 0.082), ('responses', 0.076), ('acc', 0.074), ('cohen', 0.072), ('neurons', 0.071), ('ofc', 0.068), ('reversals', 0.068), ('exploratory', 0.065), ('threshold', 0.065), ('target', 0.061), ('crossed', 0.06), ('gilzenrat', 0.06), ('trial', 0.059), ('increased', 0.057), ('monkeys', 0.057), ('ks', 0.054), ('stimulus', 0.054), ('schizophrenia', 0.051), ('rate', 0.05), ('behavioral', 0.048), ('distractor', 0.048), ('locus', 0.048), ('exploration', 0.046), ('pathway', 0.045), ('coeruleus', 0.045), ('rajkowski', 0.045), ('volatile', 0.045), ('gain', 0.042), ('contingencies', 0.041), ('presentations', 0.041), ('baseline', 0.039), ('rewarded', 0.038), ('updated', 0.037), ('dayan', 0.037), ('modify', 0.037), ('unit', 0.035), ('weights', 0.035), ('increases', 0.035), ('bar', 0.035), ('adhd', 0.034), ('bonuses', 0.034), ('efferent', 0.034), ('wyi', 0.034), ('significantly', 0.034), ('annealing', 0.034), ('colleagues', 0.031), ('el', 0.031), ('discovery', 0.031), ('relatively', 0.031), ('measures', 0.03), ('mediate', 0.03), ('behaviorally', 0.03), ('diminishing', 0.03), ('disorders', 0.03), ('montague', 0.03), ('switching', 0.029), ('reinforcement', 0.029), ('model', 0.028), ('resulted', 0.028), ('td', 0.028), ('greater', 0.028), ('salient', 0.027), ('biases', 0.027), ('change', 0.027), ('utilities', 0.027), ('interplay', 0.027), ('monkey', 0.027), ('vertical', 0.027), ('regularly', 0.025), ('switched', 0.025), ('tth', 0.025), ('inhibitory', 0.025), ('presentation', 0.025), ('horizontal', 0.025), ('cognitive', 0.024), ('improvement', 0.024), ('defined', 0.024), ('contingency', 0.024), ('fit', 0.024), ('previously', 0.024), ('inputs', 0.024), ('produces', 0.023), ('detection', 0.023), ('vx', 0.023), ('ne', 0.022), ('adapt', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999917 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity

Author: Samuel M. McClure, Mark S. Gilzenrat, Jonathan D. Cohen

Abstract: We propose a model by which dopamine (DA) and norepinepherine (NE) combine to alternate behavior between relatively exploratory and exploitative modes. The model is developed for a target detection task for which there is extant single neuron recording data available from locus coeruleus (LC) NE neurons. An exploration-exploitation trade-off is elicited by regularly switching which of the two stimuli are rewarded. DA functions within the model to change synaptic weights according to a reinforcement learning algorithm. Exploration is mediated by the state of LC firing, with higher tonic and lower phasic activity producing greater response variability. The opposite state of LC function, with lower baseline firing rate and greater phasic responses, favors exploitative behavior. Changes in LC firing mode result from combined measures of response conflict and reward rate, where response conflict is monitored using models of anterior cingulate cortex (ACC). Increased long-term response conflict and decreased reward rate, which occurs following reward contingency switch, favors the higher tonic state of LC function and NE release. This increases exploration, and facilitates discovery of the new target. 1 In t rod u ct i on A central problem in reinforcement learning is determining how to adaptively move between exploitative and exploratory behaviors in changing environments. We propose a set of neurophysiologic mechanisms whose interaction may mediate this behavioral shift. Empirical work on the midbrain dopamine (DA) system has suggested that this system is particularly well suited for guiding exploitative behaviors. This hypothesis has been reified by a number of studies showing that a temporal difference (TD) learning algorithm accounts for activity in these neurons in a wide variety of behavioral tasks [1,2]. DA release is believed to encode a reward prediction error signal that acts to change synaptic weights relevant for producing behaviors [3]. Through learning, this allows neural pathways to predict future expected reward through the relative strength of their synaptic connections [1]. Decision-making procedures based on these value estimates are necessarily greedy. Including reward bonuses for exploratory choices supports non-greedy actions [4] and accounts for additional data derived from DA neurons [5]. We show that combining a DA learning algorithm with models of response conflict detection [6] and NE function [7] produces an effective annealing procedure for alternating between exploration and exploitation. NE neurons within the LC alternate between two firing modes [8]. In the first mode, known as the phasic mode, NE neurons fire at a low baseline rate but have relatively robust phasic responses to behaviorally salient stimuli. The second mode, called the tonic mode, is associated with a higher baseline firing and absent or attenuated phasic responses. The effects of NE on efferent areas are modulatory in nature, and are well captured as a change in the gain of efferent inputs so that neuronal responses are potentiated in the presence of NE [9]. Thus, in phasic mode, the LC provides transient facilitation in processing, time-locked to the presence of behaviorally salient information in motor or decision areas. Conversely, in tonic mode, higher overall LC discharge rate increases gain generally and hence increases the probability of arbitrary responding. Consistent with this account, for periods when NE neurons are in the phasic mode, monkey performance is nearly perfect. However, when NE neurons are in the tonic mode, performance is more erratic, with increased response times and error rate [8]. These findings have led to a recent characterization of the LC as a dynamic temporal filter, adjusting the system's relative responsivity to salient and irrelevant information [8]. In this way, the LC is ideally positioned to mediate the shift between exploitative and exploratory behavior. The parameters that underlie changes in LC firing mode remain largely unexplored. Based on data from a target detection task by Aston-Jones and colleagues [10], we propose that LC firing mode is determined in part by measures of response conflict and reward rate as calculated by the ACC and OFC, respectively [8]. Together, the ACC and OFC are the principle sources of cortical input to the LC [8]. Activity in the ACC is known, largely through human neuroimaging experiments, to change in accord with response conflict [6]. In brief, relatively equal activity in competing behavioral responses (reflecting uncertainty) produces high conflict. Low conflict results when one behavioral response predominates. We propose that increased long-term response conflict biases the LC towards a tonic firing mode. Increased conflict necessarily follows changes in reward contingency. As the previously rewarded target no longer produces reward, there will be a relative increase in response ambiguity and hence conflict. This relationship between conflict and LC firing is analogous to other modeling work [11], which proposes that increased tonic firing reflects increased environmental uncertainty. As a final component to our model, we hypothesize that the OFC maintains an ongoing estimate in reward rate, and that this estimate of reward rate also influences LC firing mode. As reward rate increases, we assume that the OFC tends to bias the LC in favor of phasic firing to target stimuli. We have aimed to fix model parameters based on previous work using simpler networks. We use parameters derived primarily from a previous model of the LC by Gilzenrat and colleagues [7]. Integration of response conflict by the ACC and its influence on LC firing was borrowed from unpublished work by Gilzenrat and colleagues in which they fit human behavioral data in a diminishing utilities task. Given this approach, we interpret our observed improvement in model performance with combined NE and DA function as validation of a mechanism for automatically switching between exploitative and exploratory action selection. 2 G o- No- G o Task and Core Mod el We have modeled an experiment in which monkeys performed a target detection task [10]. In the task, monkeys were shown either a vertical bar or a horizontal bar and were required to make or omit a motor response appropriately. Initially, the vertical bar was the target stimulus and correctly responding was rewarded with a squirt of fruit juice (r=1 in the model). Responding to the non-target horizontal stimulus resulted in time out punishment (r=-.1; Figure 1A). No responses to either the target or non-target gave zero reward. After the monkeys had fully acquired the task, the experimenters periodically switched the reward contingency such that the previously rewarded stimulus (target) became the distractor, and vice versa. Following such reversals, LC neurons were observed to change from emitting phasic bursts of firing to the target, to tonic firing following the switch, and slowly back to phasic firing for the new target as the new response criteria was obtained [10]. Figure 1: Task and model design. (A) Responses were required for targets in order to obtain reward. Responses to distractors resulted in a minor punishment. No responses gave zero reward. (B) In the model, vertical and horizontal bar inputs (I1 and I 2 ) fed to integrator neurons (X1 and X2 ) which then drove response units (Y1 and Y2 ). Responses were made if Y 1 or Y2 crossed a threshold while input units were active. We have previously modeled this task [7,12] with a three-layer connectionist network in which two input units, I1 and I 2 , corresponding to the vertical and horizontal bars, drive two mutually inhibitory integrator units, X1 and X2 . The integrator units subsequently feed two response units, Y1 and Y2 (Figure 1B). Responses are made whenever output from Y1 or Y2 crosses a threshold level of activity, θ. Relatively weak cross connections from each input unit to the opposite integrator unit (I1 to X2 and I 2 to X1 ) are intended to model stimulus similarity. Both the integrator and response units were modeled as noisy, leaky accumulators: ˙ X i =

2 0.27154493 141 nips-2005-Norepinephrine and Neural Interrupts

Author: Peter Dayan, Angela J. Yu

Abstract: Experimental data indicate that norepinephrine is critically involved in aspects of vigilance and attention. Previously, we considered the function of this neuromodulatory system on a time scale of minutes and longer, and suggested that it signals global uncertainty arising from gross changes in environmental contingencies. However, norepinephrine is also known to be activated phasically by familiar stimuli in welllearned tasks. Here, we extend our uncertainty-based treatment of norepinephrine to this phasic mode, proposing that it is involved in the detection and reaction to state uncertainty within a task. This role of norepinephrine can be understood through the metaphor of neural interrupts. 1

3 0.23692279 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

Author: Yael Niv, Nathaniel D. Daw, Peter Dayan

Abstract: Reinforcement learning models have long promised to unify computational, psychological and neural accounts of appetitively conditioned behavior. However, the bulk of data on animal conditioning comes from free-operant experiments measuring how fast animals will work for reinforcement. Existing reinforcement learning (RL) models are silent about these tasks, because they lack any notion of vigor. They thus fail to address the simple observation that hungrier animals will work harder for food, as well as stranger facts such as their sometimes greater productivity even when working for irrelevant outcomes such as water. Here, we develop an RL framework for free-operant behavior, suggesting that subjects choose how vigorously to perform selected actions by optimally balancing the costs and benefits of quick responding. Motivational states such as hunger shift these factors, skewing the tradeoff. This accounts normatively for the effects of motivation on response rates, as well as many other classic findings. Finally, we suggest that tonic levels of dopamine may be involved in the computation linking motivational state to optimal responding, thereby explaining the complex vigor-related effects of pharmacological manipulation of dopamine. 1

4 0.11063558 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

5 0.082883224 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

Abstract: Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reflecting the extensive recurrence of neural circuitry. Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target specification and a movementinitiation cue is believed to be involved in motor planning. We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. 1

6 0.077885695 28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions

7 0.077560559 109 nips-2005-Learning Cue-Invariant Visual Responses

8 0.076918781 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

9 0.070040256 153 nips-2005-Policy-Gradient Methods for Planning

10 0.069910631 149 nips-2005-Optimal cue selection strategy

11 0.069668829 78 nips-2005-From Weighted Classification to Policy Search

12 0.066524133 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models

13 0.058879044 99 nips-2005-Integrate-and-Fire models with adaptation are good enough

14 0.057446815 181 nips-2005-Spiking Inputs to a Winner-take-all Network

15 0.054686073 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

16 0.054051142 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

17 0.049273051 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity

18 0.048586745 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

19 0.047828138 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

20 0.047171924 118 nips-2005-Learning in Silicon: Timing is Everything


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.144), (1, -0.156), (2, 0.129), (3, 0.032), (4, -0.014), (5, 0.112), (6, 0.003), (7, -0.181), (8, -0.103), (9, 0.145), (10, 0.049), (11, -0.152), (12, -0.03), (13, -0.104), (14, -0.094), (15, 0.023), (16, 0.117), (17, 0.001), (18, 0.066), (19, -0.256), (20, 0.05), (21, 0.2), (22, 0.061), (23, 0.051), (24, -0.062), (25, 0.039), (26, -0.206), (27, 0.031), (28, -0.056), (29, -0.015), (30, -0.042), (31, -0.176), (32, -0.176), (33, -0.07), (34, -0.048), (35, 0.017), (36, -0.06), (37, 0.005), (38, -0.012), (39, 0.013), (40, 0.016), (41, -0.052), (42, -0.065), (43, 0.032), (44, -0.119), (45, -0.055), (46, -0.012), (47, -0.094), (48, -0.083), (49, -0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96585262 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity

Author: Samuel M. McClure, Mark S. Gilzenrat, Jonathan D. Cohen

Abstract: We propose a model by which dopamine (DA) and norepinepherine (NE) combine to alternate behavior between relatively exploratory and exploitative modes. The model is developed for a target detection task for which there is extant single neuron recording data available from locus coeruleus (LC) NE neurons. An exploration-exploitation trade-off is elicited by regularly switching which of the two stimuli are rewarded. DA functions within the model to change synaptic weights according to a reinforcement learning algorithm. Exploration is mediated by the state of LC firing, with higher tonic and lower phasic activity producing greater response variability. The opposite state of LC function, with lower baseline firing rate and greater phasic responses, favors exploitative behavior. Changes in LC firing mode result from combined measures of response conflict and reward rate, where response conflict is monitored using models of anterior cingulate cortex (ACC). Increased long-term response conflict and decreased reward rate, which occurs following reward contingency switch, favors the higher tonic state of LC function and NE release. This increases exploration, and facilitates discovery of the new target. 1 In t rod u ct i on A central problem in reinforcement learning is determining how to adaptively move between exploitative and exploratory behaviors in changing environments. We propose a set of neurophysiologic mechanisms whose interaction may mediate this behavioral shift. Empirical work on the midbrain dopamine (DA) system has suggested that this system is particularly well suited for guiding exploitative behaviors. This hypothesis has been reified by a number of studies showing that a temporal difference (TD) learning algorithm accounts for activity in these neurons in a wide variety of behavioral tasks [1,2]. DA release is believed to encode a reward prediction error signal that acts to change synaptic weights relevant for producing behaviors [3]. Through learning, this allows neural pathways to predict future expected reward through the relative strength of their synaptic connections [1]. Decision-making procedures based on these value estimates are necessarily greedy. Including reward bonuses for exploratory choices supports non-greedy actions [4] and accounts for additional data derived from DA neurons [5]. We show that combining a DA learning algorithm with models of response conflict detection [6] and NE function [7] produces an effective annealing procedure for alternating between exploration and exploitation. NE neurons within the LC alternate between two firing modes [8]. In the first mode, known as the phasic mode, NE neurons fire at a low baseline rate but have relatively robust phasic responses to behaviorally salient stimuli. The second mode, called the tonic mode, is associated with a higher baseline firing and absent or attenuated phasic responses. The effects of NE on efferent areas are modulatory in nature, and are well captured as a change in the gain of efferent inputs so that neuronal responses are potentiated in the presence of NE [9]. Thus, in phasic mode, the LC provides transient facilitation in processing, time-locked to the presence of behaviorally salient information in motor or decision areas. Conversely, in tonic mode, higher overall LC discharge rate increases gain generally and hence increases the probability of arbitrary responding. Consistent with this account, for periods when NE neurons are in the phasic mode, monkey performance is nearly perfect. However, when NE neurons are in the tonic mode, performance is more erratic, with increased response times and error rate [8]. These findings have led to a recent characterization of the LC as a dynamic temporal filter, adjusting the system's relative responsivity to salient and irrelevant information [8]. In this way, the LC is ideally positioned to mediate the shift between exploitative and exploratory behavior. The parameters that underlie changes in LC firing mode remain largely unexplored. Based on data from a target detection task by Aston-Jones and colleagues [10], we propose that LC firing mode is determined in part by measures of response conflict and reward rate as calculated by the ACC and OFC, respectively [8]. Together, the ACC and OFC are the principle sources of cortical input to the LC [8]. Activity in the ACC is known, largely through human neuroimaging experiments, to change in accord with response conflict [6]. In brief, relatively equal activity in competing behavioral responses (reflecting uncertainty) produces high conflict. Low conflict results when one behavioral response predominates. We propose that increased long-term response conflict biases the LC towards a tonic firing mode. Increased conflict necessarily follows changes in reward contingency. As the previously rewarded target no longer produces reward, there will be a relative increase in response ambiguity and hence conflict. This relationship between conflict and LC firing is analogous to other modeling work [11], which proposes that increased tonic firing reflects increased environmental uncertainty. As a final component to our model, we hypothesize that the OFC maintains an ongoing estimate in reward rate, and that this estimate of reward rate also influences LC firing mode. As reward rate increases, we assume that the OFC tends to bias the LC in favor of phasic firing to target stimuli. We have aimed to fix model parameters based on previous work using simpler networks. We use parameters derived primarily from a previous model of the LC by Gilzenrat and colleagues [7]. Integration of response conflict by the ACC and its influence on LC firing was borrowed from unpublished work by Gilzenrat and colleagues in which they fit human behavioral data in a diminishing utilities task. Given this approach, we interpret our observed improvement in model performance with combined NE and DA function as validation of a mechanism for automatically switching between exploitative and exploratory action selection. 2 G o- No- G o Task and Core Mod el We have modeled an experiment in which monkeys performed a target detection task [10]. In the task, monkeys were shown either a vertical bar or a horizontal bar and were required to make or omit a motor response appropriately. Initially, the vertical bar was the target stimulus and correctly responding was rewarded with a squirt of fruit juice (r=1 in the model). Responding to the non-target horizontal stimulus resulted in time out punishment (r=-.1; Figure 1A). No responses to either the target or non-target gave zero reward. After the monkeys had fully acquired the task, the experimenters periodically switched the reward contingency such that the previously rewarded stimulus (target) became the distractor, and vice versa. Following such reversals, LC neurons were observed to change from emitting phasic bursts of firing to the target, to tonic firing following the switch, and slowly back to phasic firing for the new target as the new response criteria was obtained [10]. Figure 1: Task and model design. (A) Responses were required for targets in order to obtain reward. Responses to distractors resulted in a minor punishment. No responses gave zero reward. (B) In the model, vertical and horizontal bar inputs (I1 and I 2 ) fed to integrator neurons (X1 and X2 ) which then drove response units (Y1 and Y2 ). Responses were made if Y 1 or Y2 crossed a threshold while input units were active. We have previously modeled this task [7,12] with a three-layer connectionist network in which two input units, I1 and I 2 , corresponding to the vertical and horizontal bars, drive two mutually inhibitory integrator units, X1 and X2 . The integrator units subsequently feed two response units, Y1 and Y2 (Figure 1B). Responses are made whenever output from Y1 or Y2 crosses a threshold level of activity, θ. Relatively weak cross connections from each input unit to the opposite integrator unit (I1 to X2 and I 2 to X1 ) are intended to model stimulus similarity. Both the integrator and response units were modeled as noisy, leaky accumulators: ˙ X i =

2 0.87905431 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

Author: Yael Niv, Nathaniel D. Daw, Peter Dayan

Abstract: Reinforcement learning models have long promised to unify computational, psychological and neural accounts of appetitively conditioned behavior. However, the bulk of data on animal conditioning comes from free-operant experiments measuring how fast animals will work for reinforcement. Existing reinforcement learning (RL) models are silent about these tasks, because they lack any notion of vigor. They thus fail to address the simple observation that hungrier animals will work harder for food, as well as stranger facts such as their sometimes greater productivity even when working for irrelevant outcomes such as water. Here, we develop an RL framework for free-operant behavior, suggesting that subjects choose how vigorously to perform selected actions by optimally balancing the costs and benefits of quick responding. Motivational states such as hunger shift these factors, skewing the tradeoff. This accounts normatively for the effects of motivation on response rates, as well as many other classic findings. Finally, we suggest that tonic levels of dopamine may be involved in the computation linking motivational state to optimal responding, thereby explaining the complex vigor-related effects of pharmacological manipulation of dopamine. 1

3 0.76041311 141 nips-2005-Norepinephrine and Neural Interrupts

Author: Peter Dayan, Angela J. Yu

Abstract: Experimental data indicate that norepinephrine is critically involved in aspects of vigilance and attention. Previously, we considered the function of this neuromodulatory system on a time scale of minutes and longer, and suggested that it signals global uncertainty arising from gross changes in environmental contingencies. However, norepinephrine is also known to be activated phasically by familiar stimuli in welllearned tasks. Here, we extend our uncertainty-based treatment of norepinephrine to this phasic mode, proposing that it is involved in the detection and reaction to state uncertainty within a task. This role of norepinephrine can be understood through the metaphor of neural interrupts. 1

4 0.3697511 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

Author: Rory Sayres, David Ress, Kalanit Grill-spector

Abstract: The category of visual stimuli has been reliably decoded from patterns of neural activity in extrastriate visual cortex [1]. It has yet to be seen whether object identity can be inferred from this activity. We present fMRI data measuring responses in human extrastriate cortex to a set of 12 distinct object images. We use a simple winner-take-all classifier, using half the data from each recording session as a training set, to evaluate encoding of object identity across fMRI voxels. Since this approach is sensitive to the inclusion of noisy voxels, we describe two methods for identifying subsets of voxels in the data which optimally distinguish object identity. One method characterizes the reliability of each voxel within subsets of the data, while another estimates the mutual information of each voxel with the stimulus set. We find that both metrics can identify subsets of the data which reliably encode object identity, even when noisy measurements are artificially added to the data. The mutual information metric is less efficient at this task, likely due to constraints in fMRI data. 1

5 0.35010642 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

6 0.34878641 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

7 0.33744407 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

8 0.29276994 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

9 0.28623003 149 nips-2005-Optimal cue selection strategy

10 0.27573586 28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions

11 0.24490269 153 nips-2005-Policy-Gradient Methods for Planning

12 0.24040513 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception

13 0.22569953 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches

14 0.22451574 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

15 0.22315311 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

16 0.22278704 203 nips-2005-Visual Encoding with Jittering Eyes

17 0.22132462 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

18 0.21699736 109 nips-2005-Learning Cue-Invariant Visual Responses

19 0.20494218 121 nips-2005-Location-based activity recognition

20 0.2024125 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.026), (10, 0.045), (27, 0.034), (31, 0.054), (34, 0.046), (39, 0.03), (55, 0.024), (57, 0.012), (60, 0.347), (65, 0.085), (69, 0.054), (73, 0.013), (88, 0.067), (91, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89368093 134 nips-2005-Neural mechanisms of contrast dependent receptive field size in V1

Author: Jim Wielaard, Paul Sajda

Abstract: Based on a large scale spiking neuron model of the input layers 4Cα and β of macaque, we identify neural mechanisms for the observed contrast dependent receptive field size of V1 cells. We observe a rich variety of mechanisms for the phenomenon and analyze them based on the relative gain of excitatory and inhibitory synaptic inputs. We observe an average growth in the spatial extent of excitation and inhibition for low contrast, as predicted from phenomenological models. However, contrary to phenomenological models, our simulation results suggest this is neither sufficient nor necessary to explain the phenomenon.

same-paper 2 0.85152537 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity

Author: Samuel M. McClure, Mark S. Gilzenrat, Jonathan D. Cohen

Abstract: We propose a model by which dopamine (DA) and norepinepherine (NE) combine to alternate behavior between relatively exploratory and exploitative modes. The model is developed for a target detection task for which there is extant single neuron recording data available from locus coeruleus (LC) NE neurons. An exploration-exploitation trade-off is elicited by regularly switching which of the two stimuli are rewarded. DA functions within the model to change synaptic weights according to a reinforcement learning algorithm. Exploration is mediated by the state of LC firing, with higher tonic and lower phasic activity producing greater response variability. The opposite state of LC function, with lower baseline firing rate and greater phasic responses, favors exploitative behavior. Changes in LC firing mode result from combined measures of response conflict and reward rate, where response conflict is monitored using models of anterior cingulate cortex (ACC). Increased long-term response conflict and decreased reward rate, which occurs following reward contingency switch, favors the higher tonic state of LC function and NE release. This increases exploration, and facilitates discovery of the new target. 1 In t rod u ct i on A central problem in reinforcement learning is determining how to adaptively move between exploitative and exploratory behaviors in changing environments. We propose a set of neurophysiologic mechanisms whose interaction may mediate this behavioral shift. Empirical work on the midbrain dopamine (DA) system has suggested that this system is particularly well suited for guiding exploitative behaviors. This hypothesis has been reified by a number of studies showing that a temporal difference (TD) learning algorithm accounts for activity in these neurons in a wide variety of behavioral tasks [1,2]. DA release is believed to encode a reward prediction error signal that acts to change synaptic weights relevant for producing behaviors [3]. Through learning, this allows neural pathways to predict future expected reward through the relative strength of their synaptic connections [1]. Decision-making procedures based on these value estimates are necessarily greedy. Including reward bonuses for exploratory choices supports non-greedy actions [4] and accounts for additional data derived from DA neurons [5]. We show that combining a DA learning algorithm with models of response conflict detection [6] and NE function [7] produces an effective annealing procedure for alternating between exploration and exploitation. NE neurons within the LC alternate between two firing modes [8]. In the first mode, known as the phasic mode, NE neurons fire at a low baseline rate but have relatively robust phasic responses to behaviorally salient stimuli. The second mode, called the tonic mode, is associated with a higher baseline firing and absent or attenuated phasic responses. The effects of NE on efferent areas are modulatory in nature, and are well captured as a change in the gain of efferent inputs so that neuronal responses are potentiated in the presence of NE [9]. Thus, in phasic mode, the LC provides transient facilitation in processing, time-locked to the presence of behaviorally salient information in motor or decision areas. Conversely, in tonic mode, higher overall LC discharge rate increases gain generally and hence increases the probability of arbitrary responding. Consistent with this account, for periods when NE neurons are in the phasic mode, monkey performance is nearly perfect. However, when NE neurons are in the tonic mode, performance is more erratic, with increased response times and error rate [8]. These findings have led to a recent characterization of the LC as a dynamic temporal filter, adjusting the system's relative responsivity to salient and irrelevant information [8]. In this way, the LC is ideally positioned to mediate the shift between exploitative and exploratory behavior. The parameters that underlie changes in LC firing mode remain largely unexplored. Based on data from a target detection task by Aston-Jones and colleagues [10], we propose that LC firing mode is determined in part by measures of response conflict and reward rate as calculated by the ACC and OFC, respectively [8]. Together, the ACC and OFC are the principle sources of cortical input to the LC [8]. Activity in the ACC is known, largely through human neuroimaging experiments, to change in accord with response conflict [6]. In brief, relatively equal activity in competing behavioral responses (reflecting uncertainty) produces high conflict. Low conflict results when one behavioral response predominates. We propose that increased long-term response conflict biases the LC towards a tonic firing mode. Increased conflict necessarily follows changes in reward contingency. As the previously rewarded target no longer produces reward, there will be a relative increase in response ambiguity and hence conflict. This relationship between conflict and LC firing is analogous to other modeling work [11], which proposes that increased tonic firing reflects increased environmental uncertainty. As a final component to our model, we hypothesize that the OFC maintains an ongoing estimate in reward rate, and that this estimate of reward rate also influences LC firing mode. As reward rate increases, we assume that the OFC tends to bias the LC in favor of phasic firing to target stimuli. We have aimed to fix model parameters based on previous work using simpler networks. We use parameters derived primarily from a previous model of the LC by Gilzenrat and colleagues [7]. Integration of response conflict by the ACC and its influence on LC firing was borrowed from unpublished work by Gilzenrat and colleagues in which they fit human behavioral data in a diminishing utilities task. Given this approach, we interpret our observed improvement in model performance with combined NE and DA function as validation of a mechanism for automatically switching between exploitative and exploratory action selection. 2 G o- No- G o Task and Core Mod el We have modeled an experiment in which monkeys performed a target detection task [10]. In the task, monkeys were shown either a vertical bar or a horizontal bar and were required to make or omit a motor response appropriately. Initially, the vertical bar was the target stimulus and correctly responding was rewarded with a squirt of fruit juice (r=1 in the model). Responding to the non-target horizontal stimulus resulted in time out punishment (r=-.1; Figure 1A). No responses to either the target or non-target gave zero reward. After the monkeys had fully acquired the task, the experimenters periodically switched the reward contingency such that the previously rewarded stimulus (target) became the distractor, and vice versa. Following such reversals, LC neurons were observed to change from emitting phasic bursts of firing to the target, to tonic firing following the switch, and slowly back to phasic firing for the new target as the new response criteria was obtained [10]. Figure 1: Task and model design. (A) Responses were required for targets in order to obtain reward. Responses to distractors resulted in a minor punishment. No responses gave zero reward. (B) In the model, vertical and horizontal bar inputs (I1 and I 2 ) fed to integrator neurons (X1 and X2 ) which then drove response units (Y1 and Y2 ). Responses were made if Y 1 or Y2 crossed a threshold while input units were active. We have previously modeled this task [7,12] with a three-layer connectionist network in which two input units, I1 and I 2 , corresponding to the vertical and horizontal bars, drive two mutually inhibitory integrator units, X1 and X2 . The integrator units subsequently feed two response units, Y1 and Y2 (Figure 1B). Responses are made whenever output from Y1 or Y2 crosses a threshold level of activity, θ. Relatively weak cross connections from each input unit to the opposite integrator unit (I1 to X2 and I 2 to X1 ) are intended to model stimulus similarity. Both the integrator and response units were modeled as noisy, leaky accumulators: ˙ X i =

3 0.7732445 29 nips-2005-Analyzing Coupled Brain Sources: Distinguishing True from Spurious Interaction

Author: Guido Nolte, Andreas Ziehe, Frank Meinecke, Klaus-Robert Müller

Abstract: When trying to understand the brain, it is of fundamental importance to analyse (e.g. from EEG/MEG measurements) what parts of the cortex interact with each other in order to infer more accurate models of brain activity. Common techniques like Blind Source Separation (BSS) can estimate brain sources and single out artifacts by using the underlying assumption of source signal independence. However, physiologically interesting brain sources typically interact, so BSS will—by construction— fail to characterize them properly. Noting that there are truly interacting sources and signals that only seemingly interact due to effects of volume conduction, this work aims to contribute by distinguishing these effects. For this a new BSS technique is proposed that uses anti-symmetrized cross-correlation matrices and subsequent diagonalization. The resulting decomposition consists of the truly interacting brain sources and suppresses any spurious interaction stemming from volume conduction. Our new concept of interacting source analysis (ISA) is successfully demonstrated on MEG data. 1

4 0.6043151 138 nips-2005-Non-Local Manifold Parzen Windows

Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent

Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.

5 0.5053578 141 nips-2005-Norepinephrine and Neural Interrupts

Author: Peter Dayan, Angela J. Yu

Abstract: Experimental data indicate that norepinephrine is critically involved in aspects of vigilance and attention. Previously, we considered the function of this neuromodulatory system on a time scale of minutes and longer, and suggested that it signals global uncertainty arising from gross changes in environmental contingencies. However, norepinephrine is also known to be activated phasically by familiar stimuli in welllearned tasks. Here, we extend our uncertainty-based treatment of norepinephrine to this phasic mode, proposing that it is involved in the detection and reaction to state uncertainty within a task. This role of norepinephrine can be understood through the metaphor of neural interrupts. 1

6 0.45323297 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

7 0.44365567 181 nips-2005-Spiking Inputs to a Winner-take-all Network

8 0.44054246 109 nips-2005-Learning Cue-Invariant Visual Responses

9 0.4356342 28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions

10 0.40648407 203 nips-2005-Visual Encoding with Jittering Eyes

11 0.40055901 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

12 0.39492136 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models

13 0.39116943 183 nips-2005-Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity

14 0.38777554 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface

15 0.38472804 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning

16 0.37740606 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

17 0.37552896 150 nips-2005-Optimizing spatio-temporal filters for improving Brain-Computer Interfacing

18 0.37237149 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

19 0.3714335 99 nips-2005-Integrate-and-Fire models with adaptation are good enough

20 0.37047404 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception