nips nips2000 nips2000-43 knowledge-graph by maker-knowledge-mining

43 nips-2000-Dopamine Bonuses


Source: pdf

Author: Sham Kakade, Peter Dayan

Abstract: Substantial data support a temporal difference (TO) model of dopamine (OA) neuron activity in which the cells provide a global error signal for reinforcement learning. However, in certain circumstances, OA activity seems anomalous under the TO model, responding to non-rewarding stimuli. We address these anomalies by suggesting that OA cells multiplex information about reward bonuses, including Sutton's exploration bonuses and Ng et al's non-distorting shaping bonuses. We interpret this additional role for OA in terms of the unconditional attentional and psychomotor effects of dopamine, having the computational role of guiding exploration. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Substantial data support a temporal difference (TO) model of dopamine (OA) neuron activity in which the cells provide a global error signal for reinforcement learning. [sent-8, score-0.566]

2 However, in certain circumstances, OA activity seems anomalous under the TO model, responding to non-rewarding stimuli. [sent-9, score-0.213]

3 We address these anomalies by suggesting that OA cells multiplex information about reward bonuses, including Sutton's exploration bonuses and Ng et al's non-distorting shaping bonuses. [sent-10, score-1.156]

4 We interpret this additional role for OA in terms of the unconditional attentional and psychomotor effects of dopamine, having the computational role of guiding exploration. [sent-11, score-0.231]

5 1 Introduction Much evidence suggests that dopamine cells in the primate midbrain play an important role in reward and action learning. [sent-12, score-0.671]

6 Electrophysiological studies support a theory that OA cells signal a global prediction error for summed future reward in appetitive conditioning tasks (Montague et al, 1996; Schultz et al, 1997), in the form of a temporal difference prediction error term. [sent-13, score-0.893]

7 Appetitive prediction learning is associated with classical conditioning, the task of learning which stimuli are associated with reward; appetitive action learning is associated with instrumental conditioning, the task of learning actions that result in reward delivery. [sent-15, score-0.708]

8 The computational role of dopamine in reward learning is controversial for two main reasons (Ikemoto & Panksepp, 1999; Redgrave et al, 1999). [sent-16, score-0.596]

9 First, stimuli that are not associated with reward prediction are known to activate the dopamine system persistently, including in particular stimuli that are novel and salient, or that physically resemble other stimuli that do predict reward (Schultz, 1998). [sent-17, score-1.216]

10 Second, dopamine release is associated with a set of motor effects, such as species- and stimulus-specific approach behaviors, that seem either irrelevant or detrimental to the delivery of reward. [sent-18, score-0.353]

11 In this paper, we study this apparently anomalous activation of the OA system, suggesting that it multiplexes information about bonuses, potentially including exploration bonuses (Sutton, 1990; Dayan & Sejnowski, 1996) and shaping bonuses (Ng et al, 1999), on top of reward prediction errors. [sent-20, score-1.633]

12 These responses are associated with the unconditional effects of OA, and are part of an attentional system. [sent-21, score-0.283]

13 Figure 1: Activity of individual DA neurons - though substantial data suggest the homogeneous character of these responses (Schultz, 1998). [sent-73, score-0.081]

14 Adapted from Schultz et al (1990, 1992, & 1993) and Jacobs et al (1997). [sent-78, score-0.182]

15 2 DA Activity Figure 1 shows three different types of dopamine responses that have been observed by Schultz et al and Jacobs et al. [sent-79, score-0.437]

16 Figures 1A;B show the response to a conditioned stimulus that becomes predictive of reward (CS+). [sent-80, score-0.494]

17 For this, in early trials (figure 1A), there is no, or only a weak response to the CS+, but a strong response just after the time of delivery of the reward. [sent-81, score-0.282]

18 In later trials (figure 18), after learning is complete (but before overtraining), the DA cells are activated in response to the stimulus, and fire at background rates to the reward. [sent-82, score-0.24]

19 Indeed, if the reward is omitted, there is depression of DA activity at just the time during early trials that it used to excite the cells. [sent-83, score-0.615]

20 Under the model, the cells report the temporal difference (TD) error for reward, ie the difference in amount of reward that is delivered and the amount that is expected. [sent-85, score-0.547]

21 Let r(t) be the amount of reward received at time t and v(t) be the prediction of the sum total (undiscounted) reward to be delivered in a trial after time t, or: v(t) '" L r(T + t) . [sent-86, score-0.954]

22 (1) r~O The TD component to the dopamine activity is the prediction error: c5(t) = r(t) + v(t + 1) - v(t) (2) which uses r(t) + v(t + 1) as an estimate of :Er>or(T + t), so that the TD error is an estimate of :Er>or(T + t) - v(t). [sent-87, score-0.422]

23 An MDP consists of states, actions, transition probabilities between states under the chosen action, and the associated rewards with these transitions. [sent-90, score-0.065]

24 The goal of the subject solving a MOP is to find a policy (a choice of actions in each state) so as to optimize the sum total reward it receives. [sent-91, score-0.338]

25 Figures lC;O show that reporting a prediction error for reward does not exhaust the behavioral repertoire of the OA cells. [sent-93, score-0.456]

26 The dominant effect is that there is a phasic activation of dopamine cells followed by a phasic inhibition, both locked to the stimulus. [sent-95, score-0.598]

27 These novelty responses decrease over trials, but quite slowly for very salient stimuli (Schultz, 1998). [sent-96, score-0.425]

28 In some cases, particularly in early trials of appetitive learning (figure lA top), there seems to be little or no phasic inhibition of the cells following the activation. [sent-97, score-0.402]

29 Figure 10 shows what happens when a stimulus (door -) that resembles a reward-predicting stimulus (door +) is presented without reinforcement. [sent-98, score-0.326]

30 Again a phasic increase over baseline followed by a depression is seen (lower 10). [sent-99, score-0.166]

31 However, unlike the case in figure 1B, there is no persistent reward prediction, since if a reward is subsequently delivered (unexpectedly), the cells become active (not shown) (Schultz, 1998). [sent-100, score-0.776]

32 3 Multiplexing and reward distortion The most critical issue is whether it is possible to reconcile the behavior of the OA cells seen in figures lC;O with the putative computational role of OA in terms of reporting prediction error for reward. [sent-101, score-0.63]

33 Intuitively, these apparently anomalous responses are benign, that is they do not interfere with the end point of normal reward learning, provided that they sum to zero over a trial. [sent-102, score-0.533]

34 If we sum the prediction error terms from equation 2, starting from the time of the stimulus onset at t = I, we get L:t~l 8(t) = v(tend) - v(l) + L:t~l r(t) where tend is the time at the end of the trial. [sent-104, score-0.352]

35 Assuming that v(tend)=O and v(l) =0, ie that the monkey confines its reward predictions to within a trial, we can see that any additional influences on 8(t) that sum to 0 preserve predicted sum future rewards. [sent-105, score-0.325]

36 From figure I, this seems true of the majority of the extra responses, ie anomalous activation is canceled by anomalous inhibition, though it is not true of the uncancelled OA responses shown in figure lA (upper). [sent-106, score-0.487]

37 Altogether, OA activity can still be used to learn predictions and choose actions - although it should not strictly be referred to solely in terms of prediction error for reward. [sent-107, score-0.24]

38 Apart from the issue of anomalous activation that is not canceled (upper figure lA), this leaves open two key questions: what drives the extra OA responses; and what effects do they have. [sent-108, score-0.297]

39 4 Novelty and Bonuses Three very different sorts of bonuses have been considered in reinforcement learning, novelty, shaping and exploration bonuses. [sent-110, score-0.766]

40 The presence of the first two of these is suggested by the responses in figure 1. [sent-111, score-0.081]

41 Bonuses modify the reward signals and so change the course of learning. [sent-112, score-0.298]

42 They are mostly used to guide exploration of the world, and are typically heuristic ways of addressing the computationally intractable exploration-exploitation dilemma. [sent-113, score-0.092]

43 5 o 10 time 20 0 10 20 0 time 10 time 20 0 10 20 trial ~ ~ 0 10 20 trial Figure 2: Activity of the DA system given novelty bonuses. [sent-116, score-0.648]

44 The plots show different aspects of the TD error 8 as a function of time t within a trial (first three plots in each row) or as a function of number T of trials (last two). [sent-117, score-0.531]

45 Upper) A novelty signal was applied for just the first timesteps of the stimulus and decayed hyperbolically with trial number as liT. [sent-118, score-0.735]

46 Lower) A novelty signal was applied for the first two timesteps of the stimulus and now decayed exponentially as e-· 3T to demonstrate that the precise form of decay is irrelevant. [sent-119, score-0.548]

47 We first consider a novelty bonus, which we take as a model for uncancelled anomalous activity. [sent-123, score-0.373]

48 A novelty bonus is a value that is added to states or state-action pairs associated with their unfamiliarity - novelty is made intrinsically rewarding. [sent-124, score-0.765]

49 This is computationally reasonable, at least in moderation, and indeed it has become standard practice in reinforcement learning to use optimistic initial values for states to encourage systems to plan to get to novel or unfamiliar states. [sent-125, score-0.1]

50 In TD terms, this is like replacing the true environmental reward r(t) at time t with r(t) --t r(t) + n(x(t), T) where x(t) is the state at time t and n(x(t), T) is the novelty of this state in trial T (an index we generally suppress). [sent-126, score-0.752]

51 Here, a novel stimulus is presented for a 25 trials without there being any reward consequences. [sent-128, score-0.606]

52 The effect is just a positive signal which decreases over time. [sent-129, score-0.097]

53 Learning has no effect on this, since the stimulus cannot predict away a novelty signal that lasts only a single timestep. [sent-130, score-0.545]

54 The lower plots in figure 2 show that it is possible to get partial apparent cancellation through learning, if the novelty signal is applied for the first two timesteps of a stimulus (for instance if the novelty signal is calculated relatively slowly). [sent-131, score-0.943]

55 In this case, the initial effect is just a positive signal (leftmost graph), the effect of TD learning gives it a negative transient after a few trials (second plot), and then, as the novelty signal decays to 0, the effect goes away (third plot). [sent-132, score-0.618]

56 The righthand plots show how c5(t) behaves across trials. [sent-133, score-0.091]

57 The depression of the DA signal comes from the decay of the novelty bonuses. [sent-135, score-0.383]

58 Novelty bonuses are true bonuses in the sense that they actually distort the reward function. [sent-136, score-0.999]

59 This property makes them useful, for instance, in actually distorting the optimal policy in Markov decision problems to ensure that exploration is plmmed and executed in favor of exploitation. [sent-138, score-0.136]

60 '[tj tj d '[tj S 8 000 -1 000 -1 0 10 time 20 0 10 time 20 0 10 time D~ LJ Ej 20 0 10 20 trial 0 10 20 trial Figure 3: Activity of the DA system given shaping bonuses (in the same format as figure 2). [sent-140, score-1.101]

61 Upper) The plots show different aspects of the TD error 8 as a function of time t within a trial (first three plots) or as a function of number T of trials (last two). [sent-141, score-0.44]

62 Here, the shaping bonus comes from a if>(t) = 0 for the first two timesteps a stimulus is presented within a trial (t=1;2), and 0 thereafter, irrespective of trial number. [sent-142, score-1.201]

63 Lower) The same plots for to = 0 In answer to this concern, Ng et al (1999) invented the idea of non-distorting shaping bonuses. [sent-145, score-0.489]

64 Ng et aI's shaping bonuses are guaranteed not to distort optimal policies, although they can still change the exploratory behavior of agents. [sent-146, score-0.726]

65 Shaping bonuses must remain constant for the guarantee about the policies to hold. [sent-148, score-0.351]

66 The upper plots in figure 3 show the effect of shaping bonuses on the TD error. [sent-149, score-0.799]

67 Here, the potential function is set to the value 1 for the first two time steps of a stimulus in a trial, and 0 otherwise. [sent-150, score-0.204]

68 The most significant difference between shaping and novelty bonuses is that the former exhibits a negative transient even in the very first trial, whereas, for the latter, it is a learned effect. [sent-151, score-0.909]

69 If the learning rate is non-zero, then shaping bonuses are exactly predicted away over the course of normal learning. [sent-152, score-0.698]

70 Thus, even though the same bonus is provided on trial 25 as trial 1, the TD error becomes 0 since the shaping bonus is predicted away. [sent-153, score-1.293]

71 The dynamics of the decay shown in the last two plots is controlled by the learning rate for TD. [sent-154, score-0.091]

72 The lower plots show what happens if learning is switched off at the time the shaping bonus is provided - this would be the case if the system responsible for computing the bonus takes its effect before the inputs associated with the stimulus are plastic. [sent-155, score-1.303]

73 The final category of bonus is an ongoing exploration bonus (Sutton, 1990; Dayan & Sejnowski, 1996) which is used to ensure continued exploration. [sent-157, score-0.672]

74 Dayan & Sejnowski (1996) derived a bonus of this form from a model of environmental change that justifies the bonus. [sent-160, score-0.29]

75 There is no evidence for this sort of continuing exploration bonus in the dopamine data, perhaps not surprisingly, since the tasks undertaken by the monkey offer little possibility for any trade-off between exploration and exploitation. [sent-161, score-0.696]

76 ~~8 "171 ~rL-~p g_b_9g-1 -, o time 20 0 time 20 0 time 20 0 d. [sent-162, score-0.123]

77 I:~-d~- J time 20 0 time 40 0 time 40 Figure 4: Activity 5(t) of the dopamine for partial predictability. [sent-164, score-0.388]

78 On each trial, an initial stimulus (presented at t = 3 is ambiguous as to whether CS+ or CS- is presented (each occurs equally often), and the ambiguity is perfectly resolved at t = 4. [sent-168, score-0.201]

79 Since the CS± comes at a random interval after the cue, the traces are stimulus locked to the relevant events. [sent-170, score-0.287]

80 5 Generalization Responses and Partial Observability Generalization responses (figure 10) show a persistent effect of stimuli that merely resemble a rewarded stimulus. [sent-171, score-0.278]

81 One possibility is that this activity comes from a shaping bonus that is not learned away, as in the lower plots of figure 3. [sent-173, score-0.831]

82 This should lead to a partial activation of the OA system. [sent-176, score-0.098]

83 If the expectation is canceled by subsequent information about the stimulus (available, for instance, following a saccade), then the OA system will be inhibited below baseline exactly to nullify the earlier positive prediction. [sent-177, score-0.245]

84 If the expectation is confirmed, then there will be continued activity representing the difference between the value of the reward and the expected value given the ambiguous stimulus. [sent-178, score-0.461]

85 Figure 4 shows an example of this in a simplified case for which the animal receives information about the true stimulus over two timesteps, the first time step is ambiguous to the tune of 50%; the second perfectly resolves the ambiguity. [sent-179, score-0.242]

86 Figures 4A;B show CS+ trials, with and without the delivery of reward; figures 4C;0 CS- trials, without and with the delivery of reward. [sent-180, score-0.168]

87 Here, an cue light (c±) is provided indicating whether a CS+ or a CS- (d±) is to appear at a random later time, which in turn is followed (or not) after a fixed interval by a reward (r±). [sent-183, score-0.417]

88 OA cells show a generalization response to the cue light; and then fire to the CS+ or are unaffected by the CS-; and finally do not respond to the appropriate presence or absence of the cue. [sent-184, score-0.179]

89 The OA response stimulus locked to CS+ arises because of the variability in the interval between the cue light and the CS+; if this interval were fixed, then the cells would only respond to the cue (c+), as in Schultz (1993). [sent-186, score-0.501]

90 6 Discussion We have suggested a set of interpretations for the activity of the OA system to add to that of reporting prediction error for reward. [sent-187, score-0.251]

91 The two theoretically most interesting features are novelty and shaping bonuses. [sent-188, score-0.526]

92 The former distort the reward function in such a way to encourage exploration of new stimuli, and new places. [sent-189, score-0.496]

93 The latter are non-distorting, and can be seen as being multiplexed by the DA system together with the prediction error signal. [sent-190, score-0.107]

94 Since shaping bonuses are not distorting they have no ultimate effect on action choice. [sent-191, score-0.753]

95 However, the signal provided by the activation (and then cancellation) of DA can nevertheless have a significant neural effect. [sent-192, score-0.136]

96 For stimuli that actually predict rewards (and so cause an initial activation of the DA system), these behaviors are often called appetitive; for novel, salient, and potentially important stimuli that are not known to predict rewards, they allow the system to pay appropriate attention. [sent-194, score-0.335]

97 In the case of partial observability, DA release due to the uncertain prediction of reward directly causes further investigation, and therefore resolution of the uncertainty. [sent-196, score-0.444]

98 When unconditional and conditioned behaviors conflict, the former seem to dominate, as in the inability of animals to learn to run away from a stimulus in order to get food from it. [sent-197, score-0.388]

99 There is substantial circumstantial evidence that this might be one role for serotonin (which itself has unconditional effects associated with fear, fight, and flight responses that are opposite to those of DA), but there is not the physiological evidence to support or refute this possibility. [sent-199, score-0.35]

100 Understanding the interaction of dopamine and serotonin in terms of their conditioned and unconditioned effects is a major task for future work. [sent-200, score-0.301]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bonuses', 0.325), ('shaping', 0.307), ('reward', 0.298), ('bonus', 0.29), ('oa', 0.28), ('dopamine', 0.222), ('novelty', 0.219), ('schultz', 0.173), ('da', 0.17), ('stimulus', 0.163), ('trial', 0.153), ('td', 0.131), ('cs', 0.125), ('anomalous', 0.12), ('unconditional', 0.12), ('trials', 0.116), ('activity', 0.093), ('exploration', 0.092), ('plots', 0.091), ('cells', 0.091), ('appetitive', 0.085), ('timesteps', 0.085), ('responses', 0.081), ('stimuli', 0.079), ('dayan', 0.073), ('prediction', 0.068), ('phasic', 0.068), ('striatum', 0.068), ('depression', 0.067), ('sutton', 0.064), ('delivery', 0.059), ('ng', 0.055), ('cue', 0.055), ('delivered', 0.055), ('activation', 0.055), ('door', 0.053), ('canceled', 0.051), ('distort', 0.051), ('ikemoto', 0.051), ('panksepp', 0.051), ('reporting', 0.051), ('comes', 0.05), ('figures', 0.05), ('effect', 0.05), ('al', 0.048), ('signal', 0.047), ('salient', 0.046), ('sejnowski', 0.046), ('effects', 0.045), ('distorting', 0.044), ('locked', 0.044), ('partial', 0.043), ('et', 0.043), ('inhibition', 0.042), ('reinforcement', 0.042), ('time', 0.041), ('actions', 0.04), ('tj', 0.04), ('jacobs', 0.04), ('ventral', 0.04), ('montague', 0.04), ('conditioning', 0.04), ('behaviors', 0.04), ('error', 0.039), ('away', 0.039), ('ambiguous', 0.038), ('associated', 0.037), ('release', 0.035), ('provided', 0.034), ('decayed', 0.034), ('djor', 0.034), ('hyperbolically', 0.034), ('neurosciences', 0.034), ('observability', 0.034), ('persistent', 0.034), ('redgrave', 0.034), ('rewarded', 0.034), ('serotonin', 0.034), ('uncancelled', 0.034), ('role', 0.033), ('response', 0.033), ('difference', 0.032), ('baseline', 0.031), ('interval', 0.03), ('lc', 0.03), ('cancellation', 0.029), ('dorsal', 0.029), ('encourage', 0.029), ('sham', 0.029), ('novel', 0.029), ('rewards', 0.028), ('predict', 0.027), ('predicted', 0.027), ('action', 0.027), ('gat', 0.027), ('bertsekas', 0.027), ('upper', 0.026), ('former', 0.026), ('policies', 0.026), ('extra', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 43 nips-2000-Dopamine Bonuses

Author: Sham Kakade, Peter Dayan

Abstract: Substantial data support a temporal difference (TO) model of dopamine (OA) neuron activity in which the cells provide a global error signal for reinforcement learning. However, in certain circumstances, OA activity seems anomalous under the TO model, responding to non-rewarding stimuli. We address these anomalies by suggesting that OA cells multiplex information about reward bonuses, including Sutton's exploration bonuses and Ng et al's non-distorting shaping bonuses. We interpret this additional role for OA in terms of the unconditional attentional and psychomotor effects of dopamine, having the computational role of guiding exploration. 1

2 0.12856671 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.

3 0.12606311 49 nips-2000-Explaining Away in Weight Space

Author: Peter Dayan, Sham Kakade

Abstract: Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA, 'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; provide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.

4 0.10400096 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

Author: Paul M. Hayton, Bernhard Schölkopf, Lionel Tarassenko, Paul Anuzis

Abstract: A system has been developed to extract diagnostic information from jet engine carcass vibration data. Support Vector Machines applied to novelty detection provide a measure of how unusual the shape of a vibration signature is, by learning a representation of normality. We describe a novel method for Support Vector Machines of including information from a second class for novelty detection and give results from the application to Jet Engine vibration analysis.

5 0.10179531 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code

Author: Adrienne L. Fairhall, Geoffrey D. Lewen, William Bialek, Robert R. de Ruyter van Steveninck

Abstract: Many neural systems extend their dynamic range by adaptation. We examine the timescales of adaptation in the context of dynamically modulated rapidly-varying stimuli, and demonstrate in the fly visual system that adaptation to the statistical ensemble of the stimulus dynamically maximizes information transmission about the time-dependent stimulus. Further, while the rate response has long transients, the adaptation takes place on timescales consistent with optimal variance estimation.

6 0.09293697 40 nips-2000-Dendritic Compartmentalization Could Underlie Competition and Attentional Biasing of Simultaneous Visual Stimuli

7 0.09221936 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

8 0.086815007 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

9 0.08054357 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

10 0.077608593 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

11 0.076873265 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

12 0.073083065 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

13 0.070916533 4 nips-2000-A Linear Programming Approach to Novelty Detection

14 0.07081981 48 nips-2000-Exact Solutions to Time-Dependent MDPs

15 0.065486513 101 nips-2000-Place Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning

16 0.061197586 8 nips-2000-A New Model of Spatial Representation in Multimodal Brain Areas

17 0.059521563 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

18 0.059033453 141 nips-2000-Universality and Individuality in a Neural Code

19 0.057683613 42 nips-2000-Divisive and Subtractive Mask Effects: Linking Psychophysics and Biophysics

20 0.055963185 113 nips-2000-Robust Reinforcement Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.157), (1, -0.115), (2, -0.046), (3, -0.18), (4, -0.176), (5, 0.042), (6, -0.002), (7, 0.019), (8, 0.07), (9, -0.032), (10, -0.036), (11, 0.075), (12, -0.061), (13, 0.109), (14, 0.153), (15, -0.081), (16, -0.109), (17, -0.012), (18, -0.114), (19, 0.003), (20, 0.098), (21, 0.159), (22, 0.156), (23, -0.14), (24, 0.058), (25, 0.099), (26, -0.086), (27, 0.048), (28, 0.164), (29, 0.078), (30, -0.005), (31, -0.003), (32, 0.009), (33, -0.098), (34, 0.031), (35, 0.019), (36, 0.094), (37, 0.105), (38, 0.067), (39, -0.033), (40, 0.067), (41, 0.11), (42, 0.009), (43, 0.03), (44, -0.029), (45, 0.118), (46, -0.011), (47, -0.134), (48, 0.123), (49, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96711898 43 nips-2000-Dopamine Bonuses

Author: Sham Kakade, Peter Dayan

Abstract: Substantial data support a temporal difference (TO) model of dopamine (OA) neuron activity in which the cells provide a global error signal for reinforcement learning. However, in certain circumstances, OA activity seems anomalous under the TO model, responding to non-rewarding stimuli. We address these anomalies by suggesting that OA cells multiplex information about reward bonuses, including Sutton's exploration bonuses and Ng et al's non-distorting shaping bonuses. We interpret this additional role for OA in terms of the unconditional attentional and psychomotor effects of dopamine, having the computational role of guiding exploration. 1

2 0.55053568 49 nips-2000-Explaining Away in Weight Space

Author: Peter Dayan, Sham Kakade

Abstract: Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA, 'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; provide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.

3 0.40807196 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code

Author: Adrienne L. Fairhall, Geoffrey D. Lewen, William Bialek, Robert R. de Ruyter van Steveninck

Abstract: Many neural systems extend their dynamic range by adaptation. We examine the timescales of adaptation in the context of dynamically modulated rapidly-varying stimuli, and demonstrate in the fly visual system that adaptation to the statistical ensemble of the stimulus dynamically maximizes information transmission about the time-dependent stimulus. Further, while the rate response has long transients, the adaptation takes place on timescales consistent with optimal variance estimation.

4 0.40260121 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

Author: Paul M. Hayton, Bernhard Schölkopf, Lionel Tarassenko, Paul Anuzis

Abstract: A system has been developed to extract diagnostic information from jet engine carcass vibration data. Support Vector Machines applied to novelty detection provide a measure of how unusual the shape of a vibration signature is, by learning a representation of normality. We describe a novel method for Support Vector Machines of including information from a second class for novelty detection and give results from the application to Jet Engine vibration analysis.

5 0.39574108 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.

6 0.38454396 40 nips-2000-Dendritic Compartmentalization Could Underlie Competition and Attentional Biasing of Simultaneous Visual Stimuli

7 0.36905679 113 nips-2000-Robust Reinforcement Learning

8 0.34514216 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

9 0.31744474 4 nips-2000-A Linear Programming Approach to Novelty Detection

10 0.30605239 42 nips-2000-Divisive and Subtractive Mask Effects: Linking Psychophysics and Biophysics

11 0.30096081 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

12 0.28801939 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

13 0.27842519 105 nips-2000-Programmable Reinforcement Learning Agents

14 0.27615541 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

15 0.27567953 34 nips-2000-Competition and Arbors in Ocular Dominance

16 0.26567346 48 nips-2000-Exact Solutions to Time-Dependent MDPs

17 0.26302797 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

18 0.25818846 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

19 0.25807455 141 nips-2000-Universality and Individuality in a Neural Code

20 0.25216457 127 nips-2000-Structure Learning in Human Causal Induction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.033), (17, 0.08), (33, 0.037), (38, 0.025), (42, 0.045), (44, 0.339), (55, 0.027), (62, 0.089), (65, 0.026), (67, 0.038), (75, 0.015), (76, 0.027), (79, 0.016), (81, 0.057), (90, 0.015), (91, 0.015), (97, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81681955 43 nips-2000-Dopamine Bonuses

Author: Sham Kakade, Peter Dayan

Abstract: Substantial data support a temporal difference (TO) model of dopamine (OA) neuron activity in which the cells provide a global error signal for reinforcement learning. However, in certain circumstances, OA activity seems anomalous under the TO model, responding to non-rewarding stimuli. We address these anomalies by suggesting that OA cells multiplex information about reward bonuses, including Sutton's exploration bonuses and Ng et al's non-distorting shaping bonuses. We interpret this additional role for OA in terms of the unconditional attentional and psychomotor effects of dopamine, having the computational role of guiding exploration. 1

2 0.36997774 80 nips-2000-Learning Switching Linear Models of Human Motion

Author: Vladimir Pavlovic, James M. Rehg, John MacCormick

Abstract: The human figure exhibits complex and rich dynamic behavior that is both nonlinear and time-varying. Effective models of human dynamics can be learned from motion capture data using switching linear dynamic system (SLDS) models. We present results for human motion synthesis, classification, and visual tracking using learned SLDS models. Since exact inference in SLDS is intractable, we present three approximate inference algorithms and compare their performance. In particular, a new variational inference algorithm is obtained by casting the SLDS model as a Dynamic Bayesian Network. Classification experiments show the superiority of SLDS over conventional HMM's for our problem domain.

3 0.36584839 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

Author: Thomas Natschläger, Wolfgang Maass, Eduardo D. Sontag, Anthony M. Zador

Abstract: Experimental data show that biological synapses behave quite differently from the symbolic synapses in common artificial neural network models. Biological synapses are dynamic, i.e., their

4 0.3655526 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code

Author: Adrienne L. Fairhall, Geoffrey D. Lewen, William Bialek, Robert R. de Ruyter van Steveninck

Abstract: Many neural systems extend their dynamic range by adaptation. We examine the timescales of adaptation in the context of dynamically modulated rapidly-varying stimuli, and demonstrate in the fly visual system that adaptation to the statistical ensemble of the stimulus dynamically maximizes information transmission about the time-dependent stimulus. Further, while the rate response has long transients, the adaptation takes place on timescales consistent with optimal variance estimation.

5 0.3614082 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

Author: Anders Jonsson, Andrew G. Barto

Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.

6 0.36098269 146 nips-2000-What Can a Single Neuron Compute?

7 0.35858962 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

8 0.35777017 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

9 0.35771781 92 nips-2000-Occam's Razor

10 0.35482877 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

11 0.35439953 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

12 0.35316277 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

13 0.35153705 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

14 0.3493028 103 nips-2000-Probabilistic Semantic Video Indexing

15 0.34896278 55 nips-2000-Finding the Key to a Synapse

16 0.34832186 125 nips-2000-Stability and Noise in Biochemical Switches

17 0.34667775 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script

18 0.3460286 49 nips-2000-Explaining Away in Weight Space

19 0.34425062 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

20 0.34417796 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing