nips nips2008 nips2008-7 knowledge-graph by maker-knowledge-mining

7 nips-2008-A computational model of hippocampal function in trace conditioning

Source: pdf

Author: Elliot A. Ludvig, Richard S. Sutton, Eric Verbeek, E. J. Kehoe

Abstract: We introduce a new reinforcement-learning model for the role of the hippocampus in classical conditioning, focusing on the differences between trace and delay conditioning. In the model, all stimuli are represented both as unindividuated wholes and as a series of temporal elements with varying delays. These two stimulus representations interact, producing different patterns of learning in trace and delay conditioning. The model proposes that hippocampal lesions eliminate long-latency temporal elements, but preserve short-latency temporal elements. For trace conditioning, with no contiguity between cue and reward, these long-latency temporal elements are necessary for learning adaptively timed responses. For delay conditioning, the continued presence of the cue supports conditioned responding, and the short-latency elements suppress responding early in the cue. In accord with the empirical data, simulated hippocampal damage impairs trace conditioning, but not delay conditioning, at medium-length intervals. With longer intervals, learning is impaired in both procedures, and, with shorter intervals, in neither. In addition, the model makes novel predictions about the response topography with extended cues or post-training lesions. These results demonstrate how temporal contiguity, as in delay conditioning, changes the timing problem faced by animals, rendering it both easier and less susceptible to disruption by hippocampal lesions. The hippocampus is an important structure in many types of learning and memory, with prominent involvement in spatial navigation, episodic and working memories, stimulus conﬁguration, and contextual conditioning. One empirical phenomenon that has eluded many theories of the hippocampus is the dependence of aversive trace conditioning on an intact hippocampus (but see Rodriguez & Levy, 2001; Schmajuk & DiCarlo, 1992; Yamazaki & Tanaka, 2005). For example, trace eyeblink conditioning disappears following hippocampal lesions (Solomon et al., 1986; Moyer, Jr. et al., 1990), induces hippocampal neurogenesis (Gould et al., 1999), and produces unique activity patterns in hippocampal neurons (McEchron & Disterhoft, 1997). In this paper, we present a new abstract computational model of hippocampal function during trace conditioning. We build on a recent extension of the temporal-difference (TD) model of conditioning (Ludvig, Sutton & Kehoe, 2008; Sutton & Barto, 1990) to demonstrate how the details of stimulus representation can qualitatively alter learning during trace and delay conditioning. By gently tweaking this stimulus representation and reducing long-latency temporal elements, trace conditioning is severely impaired, whereas delay conditioning is hardly affected. In the model, the hippocampus is responsible for maintaining these long-latency elements, thus explaining the selective importance of this brain structure in trace conditioning. The difference between trace and delay conditioning is one of the most basic operational distinctions in classical conditioning (e.g., Pavlov, 1927). Figure 1 is a schematic of the two training procedures. In trace conditioning, a conditioned stimulus (CS) is followed some time later by a reward or uncon1 Trace Delay Stimulus Reward Figure 1: Event timelines in trace and delay conditioning. Time ﬂows from left-to-right in the diagram. A vertical bar represents a punctate (short) event, and the extended box is a continuously available stimulus. In delay conditioning, the stimulus and reward overlap, whereas, in trace conditioning, there is a stimulus-free gap between the two punctate events. ditioned stimulus (US); the two stimuli are separated by a stimulus-free gap. In contrast, in delay conditioning, the CS remains on until presentation of the US. Trace conditioning is learned more slowly than delay conditioning, with poorer performance often observed even at asymptote. In both eyeblink conditioning (Moyer, Jr. et al., 1990; Solomon et al., 1986; Tseng et al., 2004) and fear conditioning (e.g., McEchron et al., 1998), hippocampal damage severely impairs the acquisition of conditioned responding during trace conditioning, but not delay conditioning. These selective hippocampal deﬁcits with trace conditioning are modulated by the inter-stimulus interval (ISI) between CS onset and US onset. With very short ISIs (∼300 ms in eyeblink conditioning in rabbits), there is little deﬁcit in the acquisition of responding during trace conditioning (Moyer, Jr. et al., 1990). Furthermore, with very long ISIs (>1000 ms), delay conditioning is also impaired by hippocampal lesions (Beylin et al., 2001). These interactions between ISI and the hippocampaldependency of conditioning are the primary data that motivate the new model. 1 TD Model of Conditioning Our full model of conditioning consists of three separate modules: the stimulus representation, learning algorithm, and response rule. The explanation of hippocampal function relies mostly on the details of the stimulus representation. To illustrate the implications of these representational issues, we have chosen the temporal-difference (TD) learning algorithm from reinforcement learning (Sutton & Barto, 1990, 1998) that has become the sine qua non for modeling reward learning in dopamine neurons (e.g., Ludvig et al., 2008; Schultz, Dayan, & Montague, 1997), and a simple, leaky-integrator response rule described below. We use these for simplicity and consistency with prior work; other learning algorithms and response rules might also yield similar conclusions. 1.1 Stimulus Representation In the model, stimuli are not coherent wholes, but are represented as a series of elements or internal microstimuli. There are two types of elements in the stimulus representation: the ﬁrst is the presence microstimulus, which is exactly equivalent to the external stimulus (Sutton & Barto, 1990). This microstimulus is available whenever the corresponding stimulus is on (see Fig. 3). The second type of elements are the temporal microstimuli or spectral traces, which are a series of successively later and gradually broadening elements (see Grossberg & Schmajuk, 1989; Machado, 1997; Ludvig et al., 2008). Below, we show how the interaction between these two types of representational elements produces different styles of learning in delay and trace conditioning, resulting in differential sensitivity of these procedures to hippocampal manipulation. The temporal microstimuli are created in the model through coarse coding of a decaying memory trace triggered by stimulus onset. Figure 2 illustrates how this memory trace (left panel) is encoded by a series of basis functions evenly spaced across the height of the trace (middle panel). Each basis function effectively acts as a receptive ﬁeld for trace height: As the memory trace fades, different basis functions become more or less active, each with a particular temporal proﬁle (right panel). These activity proﬁles for the temporal microstimuli are then used to generate predictions of the US. For the basis functions, we chose simple Gaussians: 1 (y − µ)2 f (y, µ, σ) = √ exp(− ). 2σ 2 2π 2 (1) 0.4 Microstimulus Level Trace Height 1 0.75 + 0.5 0.25 0 0 20 40 60 Time Step 0.3 0.2 0.1 0 Temporal Basis Functions 0 20 40 60 Time Step Figure 2: Creating Microstimuli. The memory traces for a stimulus (left) are coarsely coded by a series of temporal basis functions (middle). The resultant time courses (right) of the temporal microstimuli are used to predict future occurrence of the US. A single basis function (middle) and approximately corresponding microstimulus (right) have been darkened. The inset in the right panel shows the levels of several microstimuli at the time indicated by the dashed line. Given these basis functions, the microstimulus levels xt (i) at time t are determined by the corresponding memory trace height: xt (i) = f (yt , i/m, σ)yt , (2) where f is the basis function deﬁned above and m is the number of temporal microstimuli per stimulus. The trace level yt was set to 1 at stimulus onset and decreased exponentially, controlled by a single decay parameter, which was allowed to vary to simulate the effects of hippocampal lesions. Every stimulus, including the US, was represented by a single memory trace and resultant microstimuli. 1.2 Hippocampal Damage We propose that hippocampal damage results in the selective loss of the long-latency temporal elements of the stimulus representation. This idea is implemented in the model through a decrease in the memory decay constant from .985 to .97, approximately doubling the decay rate of the memory trace that determines the microstimuli. In effect, we assume that hippocampal damage results in a memory trace that decays more quickly, or, equivalently, is more susceptible to interference. Figure 3 shows the effects of this parameter manipulation on the time course of the elements in the stimulus representation. The presence microstimulus is not affected by this manipulation, but the temporal microstimuli are compressed for both the CS and the US. Each microstimulus has a briefer time course, and, as a group, they cover a shorter time span. Other means for eliminating or reducing the long-latency temporal microstimuli are certainly possible and would likely be compatible with our theory. For example, if one assumes that the stimulus representation contains multiple memory traces with different time constants, each with a separate set of microstimuli, then eliminating the slower memory traces would also remove the long-latency elements, and many of the results below hold (simulations not shown). The key point is that hippocampal damage reduces the number and magnitude of long-latency microstimuli. 1.3 Learning and Responding The model approaches conditioning as a reinforcement-learning prediction problem, wherein the agent tries to predict the upcoming rewards or USs. The model learns through linear TD(λ) (Ludvig et al., 2008; Schultz et al., 1997; Sutton, 1988; Sutton & Barto, 1990, 1998). At each time step, the US prediction (Vt ) is determined by: n T Vt (x) = wt x 0 = wt (i)x(i) i=1 3 , 0 (3) Microstimulus Level Normal Hippocampal 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 500 1000 0 500 1000 Time (ms) Figure 3: Hippocampal effects on the stimulus representation. The left panel presents the stimulus representation in delay conditioning with the normal parameter settings, and the right panel presents the altered stimulus representation following simulated hippocampal damage. In the hippocampal representation, the temporal microstimuli for both CS (red, solid lines) and US (green, dashed lines) are all briefer and shallower. The presence microstimuli (blue square wave and black spike) are not affected by the hippocampal manipulation. where x is a vector of the activation levels x(i) for the various microstimuli, wt is a corresponding vector of adjustable weights wt (i) at time step t, and n is the total number of all microstimuli. The US prediction is constrained to be non-negative, with negative values rectiﬁed to 0. As is standard in TD models, this US prediction is compared to the reward received and the previous US prediction to generate a TD error (δt ): δt = rt + γVt (xt ) − Vt (xt−1 ), (4) where γ is a discount factor that determines the temporal horizon of the US prediction. This TD error is then used to update the weight vector based on the following update rule: wt+1 = wt + αδt et , (5) where α is a step-size parameter and et is a vector of eligibility trace levels (see Sutton & Barto, 1998), which together help determine the speed of learning. Each microstimulus has its own corresponding eligibility trace which continuously decays, but accumulates whenever that microstimulus is present: et+1 = γλet + xt , (6) where γ is the discount factor as above and λ is a decay parameter that determines the plasticity window. These US predictions are translated into responses through a simple, thresholded leakyintegrator response rule: at+1 = υat + Vt+1 (xt ) θ , (7) where υ is a decay constant, and θ is a threshold on the value function V . Our model is deﬁned by Equations 1-7 and 7 additional parameters, which were ﬁxed at the following values for the simulations below: λ = .95, α = .005, γ = .97, n = 50, σ = .08, υ = .93, θ = .25. In the simulated experiments, one time step was interpreted as 10 ms. 4 CR Magnitude ISI250 5 4 3 3 Delay!Normal 2 Delay!HPC Trace!Normal 1 Trace!HPC 0 250 500 50 3 2 1 50 ISI1000 5 4 4 0 ISI500 5 2 1 250 500 0 50 250 500 Trials Figure 4: Learning in the model for trace and delay conditioning with and without hippocampal (HPC) damage. The three panels present training with different interstimulus intervals (ISI). 2 Results We simulated 12 total conditions with the model: trace and delay conditioning, both with and without hippocampal damage, for short (250 ms), medium (500 ms), and long (1000 ms) ISIs. Each simulated experiment was run for 500 trials, with every 5th trial an unreinforced probe trial, during which no US was presented. For delay conditioning, the CS lasted the same duration as the ISI and terminated with US presentation. For trace conditioning, the CS was present for 5 time steps (50 ms). The US always lasted for a single time step, and an inter-trial interval of 5000 ms separated all trials (onset to onset). Conditioned responding (CR magnitude) was measured as the maximum height of the response curve on a given trial. 0.8 CR Magnitude US Prediction Figure 4 summarizes our results. The ﬁgure depicts how the CR magnitude changed across the 500 trials of acquisition training. In general, trace conditioning produced lower levels of responding than delay conditioning, but this effect was most pronounced with the longest ISI. The effects of simulated hippocampal damage varied with the ISI. With the shortest ISI (250 ms; left panel), there was little effect on responding in either trace or delay conditioning. There was a small deﬁcit early in training with trace conditioning, but this difference disappeared quickly with further training. With the longest ISI (1000 ms; right panel), there was a profound effect on responding in both trace and delay conditioning, with trace conditioning completely eliminated. The intermediate ISI (500 ms; middle panel) produced the most complex and interesting results. With this interval, there was only a minor deﬁcit in delay conditioning, but a substantial drop in trace conditioning, especially early in training. This pattern of results roughly matches the empirical data, capturing the selective deﬁcit in trace conditioning caused by hippocampal lesions (Solomon et al., 1986) as well as the modulation of this deﬁcit by ISI (Beylin et al., 2001; Moyer, Jr. et al., 1990). Delay Trace 0.6 0.4 0.2 0 0 250 500 750 Time (ms) 5 4 3 2 1 0 0 250 500 750 Time (ms) Figure 5: Time course of US prediction and CR magnitude for both trace (red, dashed line) and delay conditioning (blue, solid line) with a 500-ms ISI. 5 These differences in sensitivity to simulated hippocampal damage arose despite similar model performance during normal trace and delay conditioning. Figure 5 shows the time course of the US prediction (left panel) and CR magnitude (right panel) after trace and delay conditioning on a probe trial with a 500-ms ISI. In both instances, the US prediction grew throughout the trial as the usual time of the US became imminent. Note the sharp drop off in US prediction for delay conditioning exactly as the CS terminates. This change reﬂects the disappearance of the presence microstimulus, which supports much of the responding in delay conditioning (see Fig. 6). In both procedures, even after the usual time of the US (and CS termination in the case of delay conditioning), there was still some residual US prediction. These US predictions were caused by the long-latency microstimuli, which did not disappear exactly at CS offset, and were ordinarily (on non-probe trials) countered by negative weights on the US microstimuli. The CR magnitude tracked the US prediction curve quite closely, peaking around the time the US would have occurred for both trace and delay conditioning. There was little difference in either curve between trace and delay conditioning, yet altering the stimulus representation (see Fig. 3) had a more pronounced effect on trace conditioning. An examination of the weight distribution for trace and delay conditioning explains why hippocampal damage had a more pronounced effect on trace than delay conditioning. Figure 6 depicts some representative microstimuli (left column) as well as their corresponding weights (right columns) following trace or delay conditioning with or without simulated hippocampal damage. For clarity in the ﬁgure, we have grouped the weights into four categories: positive (+), large positive (+++), negative (-), and large negative (--). The left column also depicts how the model poses the computational problem faced by an animal during conditioning; the goal is to sum together weighted versions of the available microstimuli to produce the ideal US prediction curve in the bottom row. In normal delay conditioning, the model placed a high positive weight on the presence microstimulus, but balanced that with large negative weights on the early CS microstimuli, producing a prediction topography that roughly matched the ideal prediction (see Fig. 5, left panel). In normal trace conditioning, the model only placed a small positive weight on the presence microstimulus, but supplemented that with large positive weights on both the early and late CS microstimuli, also producing a prediction topography that roughly matched the ideal prediction. Weights Normal HPC Lesion Delay CS Presence Stimulus CS Early Microstimuli CS Late Microstimuli US Early Microstimuli Trace Delay Trace +++ + +++ + -- + -- + + +++ N/A N/A - -- - - Ideal Summed Prediction Figure 6: Schematic of the weights (right columns) on various microstimuli following trace and delay conditioning. The left column illustrates four representative microstimuli: the presence microstimulus, an early CS microstimulus, a late CS microstimulus, and a US microstimulus. The ideal prediction is the expectation of the sum of future discounted rewards. 6 Following hippocampal lesions, the late CS microstimuli were no longer available (N/A), and the system could only use the other microstimuli to generate the best possible prediction proﬁle. In delay conditioning, the loss of these long-latency microstimuli had a small effect, notable only with the longest ISI (1000 ms) with these parameter settings. With trace conditioning, the loss of the long-latency microstimuli was catastrophic, as these microstimuli were usually the major basis for the prediction of the upcoming US. As a result, trace conditioning became much more difﬁcult (or impossible in the case of the 1000-ms ISI), even though delay conditioning was less affected. The most notable (and deﬁning) difference between trace and delay conditioning is that the CS and US overlap in delay conditioning, but not trace conditioning. In our model, this overlap is necessary, but not sufﬁcient, for the the unique interaction between the presence microstimulus and temporal microstimuli in delay conditioning. For example, if the CS were extended to stay on beyond the time of US occurrence, this contiguity would be maintained, but negative weights on the early CS microstimuli would not sufﬁce to suppress responding throughout this extended CS. In this case, the best solution to predicting the US for the model might be to put high weights on the long-latency temporal microstimuli (as in trace conditioning; see Fig 6), which would not persist as long as the now extended presence microstimulus. Indeed, with a CS that was three times as long as the ISI, we found that the US prediction, CR magnitude, and underlying weights were completely indistinguishable from trace conditioning (simulations not shown). Thus, the model predicts that this extended delay conditioning should be equally sensitive to hippocampal damage as trace conditioning for the same ISIs. This empirical prediction is a fundamental test of the representational assumptions underlying the model. The particular mechanism that we chose for simulating the loss of the long-latency microstimuli (increasing the decay rate of the memory trace) also leads to a testable model prediction. If one were to pre-train an animal with trace conditioning and then perform hippocampal lesions, there should be some loss of responding, but, more importantly, those CRs that do occur should appear earlier in the interval because the temporal microstimuli now follow a shorter time course (see Fig. 3). There is some evidence for additional short-latency CRs during trace conditioning in lesioned animals (e.g., Port et al., 1986; Solomon et al., 1986), but, to our knowledge, this precise model prediction has not been rigorously evaluated. 3 Discussion and Conclusion We evaluated a novel computational model for the role of the hippocampus in trace conditioning, based on a reinforcement-learning framework. We extended the microstimulus TD model presented by Ludvig et al. (2008) by suggesting a role for the hippocampus in maintaining long-latency elements of the temporal stimulus representation. The current model also introduced an additional element to the stimulus representation (the presence microstimulus) and a simple response rule for translating prediction into actions; we showed how these subtle innovations yield interesting interactions when comparing trace and delay conditioning. In addition, we adduced a pair of testable model predictions about the effects of extended stimuli and post-training lesions. There are several existing theories for the role of the hippocampus in trace conditioning, including the modulation of timing (Solomon et al., 1986), establishment of contiguity (e.g., Wallenstein et al., 1998), and overcoming of task difﬁculty (Beylin et al., 2001). Our new model provides a computational mechanism that links these three proposed explanations. In our model, for similar ISIs, delay conditioning requires learning to suppress responding early in the CS, whereas trace conditioning requires learning to create responding later in the trial, near the time of the US (see Fig. 6). As a result, for the same ISI, delay conditioning requires changing weights associated with earlier microstimuli than trace conditioning, though in opposite directions. These early microstimuli reach higher activation levels (see Fig. 2), producing higher eligibility traces, and are therefore learned about more quickly. This differential speed of learning for short-latency temporal microstimuli corresponds with much behavioural data that shorter ISIs tend to improve both the speed and asymptote of learning in eyeblink conditioning (e.g., Schneiderman & Gormerzano, 1964). Thus, the contiguity between the CS and US in delay conditioning alters the timing problem that the animal faces, effectively making the time interval to be learned shorter, and rendering the task easier for most ISIs. In future work, it will be important to characterize the exact mathematical properties that constrain the temporal microstimuli. Our simple Gaussian basis function approach sufﬁces for the datasets 7 examined here (cf. Ludvig et al., 2008), but other related mathematical functions are certainly possible. For example, replacing the temporal microstimuli in our model with the spectral traces of Grossberg & Schmajuk (1989) produces results that are similar to ours, but using sequences of Gamma-shaped functions tends to fail, with longer intervals learned too slowly relative to shorter intervals. One important characteristic of the microstimulus series seems to be that the heights of individual elements should not decay too quickly. Another key challenge for future modeling is reconciling this abstract account of hippocampal function in trace conditioning with approaches that consider greater physiological detail (e.g., Rodriguez & Levy, 2001; Yamazaki & Tanaka, 2005). The current model also contributes to our understanding of the TD models of dopamine (e.g., Schultz et al., 1997) and classical conditioning (Sutton & Barto, 1990). These models have often given short shrift to issues of stimulus representation, focusing more closely on the properties of the learning algorithm (but see Ludvig et al., 2008). Here, we reveal how the interaction of various stimulus representations in conjunction with the TD learning rule produces a viable model of some of the differences between trace and delay conditioning. References Beylin, A. V., Gandhi, C. C, Wood, G. E., Talk, A. C., Matzel, L. D., & Shors, T. J. (2001). The role of the hippocampus in trace conditioning: Temporal discontinuity or task difﬁculty? Neurobiology of Learning & Memory, 76, 447-61. Gould, E., Beylin, A., Tanapat, P., Reeves, A., & Shors, T. J. (1999). Learning enhances adult neurogenesis in the hippocampal formation. Nature Neuroscience, 2, 260-5. Grossberg, S., & Schmajuk, N. A. (1989). Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks, 2, 79-102. Ludvig, E. A., Sutton, R. S., & Kehoe, E. J. (2008). Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Computation, 20, 3034-54. Machado, A. (1997). Learning the temporal dynamics of behavior. Psychological Review, 104, 241-265. McEchron, M. D., Bouwmeester, H., Tseng, W., Weiss, C., & Disterhoft, J. F. (1998). Hippocampectomy disrupts auditory trace fear conditioning and contextual fear conditioning in the rat. Hippocampus, 8, 63846. McEchron, M. D., Disterhoft, J. F. (1997). Sequence of single neuron changes in CA1 hippocampus of rabbits during acquisition of trace eyeblink conditioned responses. Journal of Neurophysiology, 78, 1030-44. Moyer, J. R., Jr., Deyo, R. A., & Disterhoft, J. F. (1990). Hippocampectomy disrupts trace eye-blink conditioning in rabbits. Behavioral Neuroscience, 104, 243-52. Pavlov, I. P. (1927). Conditioned Reﬂexes. London: Oxford University Press. Port, R. L., Romano, A. G., Steinmetz, J. E., Mikhail, A. A., & Patterson, M. M. (1986). Retention and acquisition of classical trace conditioned responses by rabbits with hippocampal lesions. Behavioral Neuroscience, 100, 745-752. Rodriguez, P., & Levy, W. B. (2001). A model of hippocampal activity in trace conditioning: Where’s the trace? Behavioral Neuroscience, 115, 1224-1238. Schmajuk, N. A., & DiCarlo, J. J. (1992). Stimulus conﬁguration, classical conditioning, and hippocampal function. Psychological Review, 99, 268-305. Schneiderman, N., & Gormezano, I. (1964). Conditioning of the nictitating membrane of the rabbit as a function of CS-US interval. Journal of Comparative and Physiological Psychology, 57, 188-195. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593-9. Solomon, P. R., Vander Schaaf, E. R., Thompson, R. F., & Weisz, D. J. (1986). Hippocampus and trace conditioning of the rabbit’s classically conditioned nictitating membrane response. Behavioral Neuroscience, 100, 729-744. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and Computational Neuroscience: Foundations of Adaptive Networks (pp. 497-537). Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press. Tseng, W., Guan, R., Disterhoft, J. F., & Weiss, C. (2004). Trace eyeblink conditioning is hippocampally dependent in mice. Hippocampus, 14, 58-65. Wallenstein, G., Eichenbaum, H., & Hasselmo, M. (1998). The hippocampus as an associator of discontiguous events. Trends in Neuroscience, 21, 317-323. Yamazaki, T., & Tanaka, S. (2005). A neural network model for trace conditioning. International Journal of Neural Systems, 15, 23-30. 8

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A computational model of hippocampal function in trace conditioning Elliot A. [sent-1, score-1.092]

2 au Abstract We introduce a new reinforcement-learning model for the role of the hippocampus in classical conditioning, focusing on the differences between trace and delay conditioning. [sent-9, score-0.865]

3 These two stimulus representations interact, producing different patterns of learning in trace and delay conditioning. [sent-11, score-0.897]

4 The model proposes that hippocampal lesions eliminate long-latency temporal elements, but preserve short-latency temporal elements. [sent-12, score-0.546]

5 For trace conditioning, with no contiguity between cue and reward, these long-latency temporal elements are necessary for learning adaptively timed responses. [sent-13, score-0.557]

6 For delay conditioning, the continued presence of the cue supports conditioned responding, and the short-latency elements suppress responding early in the cue. [sent-14, score-0.631]

7 In accord with the empirical data, simulated hippocampal damage impairs trace conditioning, but not delay conditioning, at medium-length intervals. [sent-15, score-1.198]

8 These results demonstrate how temporal contiguity, as in delay conditioning, changes the timing problem faced by animals, rendering it both easier and less susceptible to disruption by hippocampal lesions. [sent-18, score-0.8]

9 One empirical phenomenon that has eluded many theories of the hippocampus is the dependence of aversive trace conditioning on an intact hippocampus (but see Rodriguez & Levy, 2001; Schmajuk & DiCarlo, 1992; Yamazaki & Tanaka, 2005). [sent-20, score-1.001]

10 For example, trace eyeblink conditioning disappears following hippocampal lesions (Solomon et al. [sent-21, score-1.265]

11 In this paper, we present a new abstract computational model of hippocampal function during trace conditioning. [sent-26, score-0.689]

12 We build on a recent extension of the temporal-difference (TD) model of conditioning (Ludvig, Sutton & Kehoe, 2008; Sutton & Barto, 1990) to demonstrate how the details of stimulus representation can qualitatively alter learning during trace and delay conditioning. [sent-27, score-1.321]

13 By gently tweaking this stimulus representation and reducing long-latency temporal elements, trace conditioning is severely impaired, whereas delay conditioning is hardly affected. [sent-28, score-1.815]

14 In the model, the hippocampus is responsible for maintaining these long-latency elements, thus explaining the selective importance of this brain structure in trace conditioning. [sent-29, score-0.514]

15 The difference between trace and delay conditioning is one of the most basic operational distinctions in classical conditioning (e. [sent-30, score-1.564]

16 In trace conditioning, a conditioned stimulus (CS) is followed some time later by a reward or uncon1 Trace Delay Stimulus Reward Figure 1: Event timelines in trace and delay conditioning. [sent-34, score-1.334]

17 In delay conditioning, the stimulus and reward overlap, whereas, in trace conditioning, there is a stimulus-free gap between the two punctate events. [sent-37, score-0.945]

18 Trace conditioning is learned more slowly than delay conditioning, with poorer performance often observed even at asymptote. [sent-40, score-0.777]

19 , 1998), hippocampal damage severely impairs the acquisition of conditioned responding during trace conditioning, but not delay conditioning. [sent-48, score-1.328]

20 These selective hippocampal deﬁcits with trace conditioning are modulated by the inter-stimulus interval (ISI) between CS onset and US onset. [sent-49, score-1.14]

21 With very short ISIs (∼300 ms in eyeblink conditioning in rabbits), there is little deﬁcit in the acquisition of responding during trace conditioning (Moyer, Jr. [sent-50, score-1.457]

22 Furthermore, with very long ISIs (>1000 ms), delay conditioning is also impaired by hippocampal lesions (Beylin et al. [sent-53, score-1.211]

23 These interactions between ISI and the hippocampaldependency of conditioning are the primary data that motivate the new model. [sent-55, score-0.403]

24 1 TD Model of Conditioning Our full model of conditioning consists of three separate modules: the stimulus representation, learning algorithm, and response rule. [sent-56, score-0.542]

25 The explanation of hippocampal function relies mostly on the details of the stimulus representation. [sent-57, score-0.444]

26 The second type of elements are the temporal microstimuli or spectral traces, which are a series of successively later and gradually broadening elements (see Grossberg & Schmajuk, 1989; Machado, 1997; Ludvig et al. [sent-68, score-0.61]

27 Below, we show how the interaction between these two types of representational elements produces different styles of learning in delay and trace conditioning, resulting in differential sensitivity of these procedures to hippocampal manipulation. [sent-70, score-1.116]

28 The temporal microstimuli are created in the model through coarse coding of a decaying memory trace triggered by stimulus onset. [sent-71, score-1.07]

29 Figure 2 illustrates how this memory trace (left panel) is encoded by a series of basis functions evenly spaced across the height of the trace (middle panel). [sent-72, score-0.879]

30 Each basis function effectively acts as a receptive ﬁeld for trace height: As the memory trace fades, different basis functions become more or less active, each with a particular temporal proﬁle (right panel). [sent-73, score-0.958]

31 These activity proﬁles for the temporal microstimuli are then used to generate predictions of the US. [sent-74, score-0.498]

32 The resultant time courses (right) of the temporal microstimuli are used to predict future occurrence of the US. [sent-85, score-0.498]

33 The inset in the right panel shows the levels of several microstimuli at the time indicated by the dashed line. [sent-87, score-0.455]

34 Given these basis functions, the microstimulus levels xt (i) at time t are determined by the corresponding memory trace height: xt (i) = f (yt , i/m, σ)yt , (2) where f is the basis function deﬁned above and m is the number of temporal microstimuli per stimulus. [sent-88, score-1.249]

35 The trace level yt was set to 1 at stimulus onset and decreased exponentially, controlled by a single decay parameter, which was allowed to vary to simulate the effects of hippocampal lesions. [sent-89, score-0.887]

36 Every stimulus, including the US, was represented by a single memory trace and resultant microstimuli. [sent-90, score-0.433]

37 2 Hippocampal Damage We propose that hippocampal damage results in the selective loss of the long-latency temporal elements of the stimulus representation. [sent-92, score-0.679]

38 97, approximately doubling the decay rate of the memory trace that determines the microstimuli. [sent-95, score-0.467]

39 In effect, we assume that hippocampal damage results in a memory trace that decays more quickly, or, equivalently, is more susceptible to interference. [sent-96, score-0.826]

40 The presence microstimulus is not affected by this manipulation, but the temporal microstimuli are compressed for both the CS and the US. [sent-98, score-0.762]

41 Other means for eliminating or reducing the long-latency temporal microstimuli are certainly possible and would likely be compatible with our theory. [sent-100, score-0.498]

42 The key point is that hippocampal damage reduces the number and magnitude of long-latency microstimuli. [sent-102, score-0.42]

43 3 Learning and Responding The model approaches conditioning as a reinforcement-learning prediction problem, wherein the agent tries to predict the upcoming rewards or USs. [sent-104, score-0.444]

44 The left panel presents the stimulus representation in delay conditioning with the normal parameter settings, and the right panel presents the altered stimulus representation following simulated hippocampal damage. [sent-117, score-1.551]

45 In the hippocampal representation, the temporal microstimuli for both CS (red, solid lines) and US (green, dashed lines) are all briefer and shallower. [sent-118, score-0.826]

46 The presence microstimuli (blue square wave and black spike) are not affected by the hippocampal manipulation. [sent-119, score-0.75]

47 This TD error is then used to update the weight vector based on the following update rule: wt+1 = wt + αδt et , (5) where α is a step-size parameter and et is a vector of eligibility trace levels (see Sutton & Barto, 1998), which together help determine the speed of learning. [sent-123, score-0.527]

48 Each microstimulus has its own corresponding eligibility trace which continuously decays, but accumulates whenever that microstimulus is present: et+1 = γλet + xt , (6) where γ is the discount factor as above and λ is a decay parameter that determines the plasticity window. [sent-124, score-0.915]

49 HPC 0 250 500 50 3 2 1 50 ISI1000 5 4 4 0 ISI500 5 2 1 250 500 0 50 250 500 Trials Figure 4: Learning in the model for trace and delay conditioning with and without hippocampal (HPC) damage. [sent-138, score-1.466]

50 2 Results We simulated 12 total conditions with the model: trace and delay conditioning, both with and without hippocampal damage, for short (250 ms), medium (500 ms), and long (1000 ms) ISIs. [sent-140, score-1.09]

51 For delay conditioning, the CS lasted the same duration as the ISI and terminated with US presentation. [sent-142, score-0.394]

52 For trace conditioning, the CS was present for 5 time steps (50 ms). [sent-143, score-0.384]

53 In general, trace conditioning produced lower levels of responding than delay conditioning, but this effect was most pronounced with the longest ISI. [sent-149, score-1.28]

54 The effects of simulated hippocampal damage varied with the ISI. [sent-150, score-0.42]

55 With the shortest ISI (250 ms; left panel), there was little effect on responding in either trace or delay conditioning. [sent-151, score-0.857]

56 There was a small deﬁcit early in training with trace conditioning, but this difference disappeared quickly with further training. [sent-152, score-0.419]

57 With the longest ISI (1000 ms; right panel), there was a profound effect on responding in both trace and delay conditioning, with trace conditioning completely eliminated. [sent-153, score-1.664]

58 With this interval, there was only a minor deﬁcit in delay conditioning, but a substantial drop in trace conditioning, especially early in training. [sent-155, score-0.793]

59 This pattern of results roughly matches the empirical data, capturing the selective deﬁcit in trace conditioning caused by hippocampal lesions (Solomon et al. [sent-156, score-1.22]

60 2 0 0 250 500 750 Time (ms) 5 4 3 2 1 0 0 250 500 750 Time (ms) Figure 5: Time course of US prediction and CR magnitude for both trace (red, dashed line) and delay conditioning (blue, solid line) with a 500-ms ISI. [sent-164, score-1.229]

61 5 These differences in sensitivity to simulated hippocampal damage arose despite similar model performance during normal trace and delay conditioning. [sent-165, score-1.204]

62 Figure 5 shows the time course of the US prediction (left panel) and CR magnitude (right panel) after trace and delay conditioning on a probe trial with a 500-ms ISI. [sent-166, score-1.25]

63 Note the sharp drop off in US prediction for delay conditioning exactly as the CS terminates. [sent-168, score-0.818]

64 This change reﬂects the disappearance of the presence microstimulus, which supports much of the responding in delay conditioning (see Fig. [sent-169, score-0.914]

65 The CR magnitude tracked the US prediction curve quite closely, peaking around the time the US would have occurred for both trace and delay conditioning. [sent-173, score-0.826]

66 There was little difference in either curve between trace and delay conditioning, yet altering the stimulus representation (see Fig. [sent-174, score-0.918]

67 An examination of the weight distribution for trace and delay conditioning explains why hippocampal damage had a more pronounced effect on trace than delay conditioning. [sent-176, score-2.312]

68 Figure 6 depicts some representative microstimuli (left column) as well as their corresponding weights (right columns) following trace or delay conditioning with or without simulated hippocampal damage. [sent-177, score-1.945]

69 The left column also depicts how the model poses the computational problem faced by an animal during conditioning; the goal is to sum together weighted versions of the available microstimuli to produce the ideal US prediction curve in the bottom row. [sent-179, score-0.492]

70 In normal delay conditioning, the model placed a high positive weight on the presence microstimulus, but balanced that with large negative weights on the early CS microstimuli, producing a prediction topography that roughly matched the ideal prediction (see Fig. [sent-180, score-0.634]

71 In normal trace conditioning, the model only placed a small positive weight on the presence microstimulus, but supplemented that with large positive weights on both the early and late CS microstimuli, also producing a prediction topography that roughly matched the ideal prediction. [sent-182, score-0.631]

72 6 Following hippocampal lesions, the late CS microstimuli were no longer available (N/A), and the system could only use the other microstimuli to generate the best possible prediction proﬁle. [sent-186, score-1.188]

73 In delay conditioning, the loss of these long-latency microstimuli had a small effect, notable only with the longest ISI (1000 ms) with these parameter settings. [sent-187, score-0.801]

74 With trace conditioning, the loss of the long-latency microstimuli was catastrophic, as these microstimuli were usually the major basis for the prediction of the upcoming US. [sent-188, score-1.264]

75 As a result, trace conditioning became much more difﬁcult (or impossible in the case of the 1000-ms ISI), even though delay conditioning was less affected. [sent-189, score-1.564]

76 The most notable (and deﬁning) difference between trace and delay conditioning is that the CS and US overlap in delay conditioning, but not trace conditioning. [sent-190, score-1.919]

77 In our model, this overlap is necessary, but not sufﬁcient, for the the unique interaction between the presence microstimulus and temporal microstimuli in delay conditioning. [sent-191, score-1.136]

78 For example, if the CS were extended to stay on beyond the time of US occurrence, this contiguity would be maintained, but negative weights on the early CS microstimuli would not sufﬁce to suppress responding throughout this extended CS. [sent-192, score-0.639]

79 In this case, the best solution to predicting the US for the model might be to put high weights on the long-latency temporal microstimuli (as in trace conditioning; see Fig 6), which would not persist as long as the now extended presence microstimulus. [sent-193, score-0.945]

80 Indeed, with a CS that was three times as long as the ISI, we found that the US prediction, CR magnitude, and underlying weights were completely indistinguishable from trace conditioning (simulations not shown). [sent-194, score-0.812]

81 Thus, the model predicts that this extended delay conditioning should be equally sensitive to hippocampal damage as trace conditioning for the same ISIs. [sent-195, score-1.957]

82 The particular mechanism that we chose for simulating the loss of the long-latency microstimuli (increasing the decay rate of the memory trace) also leads to a testable model prediction. [sent-197, score-0.49]

83 There is some evidence for additional short-latency CRs during trace conditioning in lesioned animals (e. [sent-200, score-0.787]

84 3 Discussion and Conclusion We evaluated a novel computational model for the role of the hippocampus in trace conditioning, based on a reinforcement-learning framework. [sent-205, score-0.491]

85 The current model also introduced an additional element to the stimulus representation (the presence microstimulus) and a simple response rule for translating prediction into actions; we showed how these subtle innovations yield interesting interactions when comparing trace and delay conditioning. [sent-208, score-0.997]

86 There are several existing theories for the role of the hippocampus in trace conditioning, including the modulation of timing (Solomon et al. [sent-210, score-0.567]

87 In our model, for similar ISIs, delay conditioning requires learning to suppress responding early in the CS, whereas trace conditioning requires learning to create responding later in the trial, near the time of the US (see Fig. [sent-217, score-1.821]

88 As a result, for the same ISI, delay conditioning requires changing weights associated with earlier microstimuli than trace conditioning, though in opposite directions. [sent-219, score-1.593]

89 These early microstimuli reach higher activation levels (see Fig. [sent-220, score-0.442]

90 This differential speed of learning for short-latency temporal microstimuli corresponds with much behavioural data that shorter ISIs tend to improve both the speed and asymptote of learning in eyeblink conditioning (e. [sent-222, score-1.007]

91 Thus, the contiguity between the CS and US in delay conditioning alters the timing problem that the animal faces, effectively making the time interval to be learned shorter, and rendering the task easier for most ISIs. [sent-225, score-0.856]

92 For example, replacing the temporal microstimuli in our model with the spectral traces of Grossberg & Schmajuk (1989) produces results that are similar to ours, but using sequences of Gamma-shaped functions tends to fail, with longer intervals learned too slowly relative to shorter intervals. [sent-230, score-0.597]

93 Another key challenge for future modeling is reconciling this abstract account of hippocampal function in trace conditioning with approaches that consider greater physiological detail (e. [sent-232, score-1.092]

94 Here, we reveal how the interaction of various stimulus representations in conjunction with the TD learning rule produces a viable model of some of the differences between trace and delay conditioning. [sent-241, score-0.897]

95 The role of the hippocampus in trace conditioning: Temporal discontinuity or task difﬁculty? [sent-254, score-0.491]

96 Hippocampectomy disrupts auditory trace fear conditioning and contextual fear conditioning in the rat. [sent-292, score-1.278]

97 Sequence of single neuron changes in CA1 hippocampus of rabbits during acquisition of trace eyeblink conditioned responses. [sent-299, score-0.651]

98 Retention and acquisition of classical trace conditioned responses by rabbits with hippocampal lesions. [sent-327, score-0.781]

99 A model of hippocampal activity in trace conditioning: Where’s the trace? [sent-333, score-0.689]

100 Hippocampus and trace conditioning of the rabbit’s classically conditioned nictitating membrane response. [sent-363, score-0.838]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('microstimuli', 0.407), ('conditioning', 0.403), ('trace', 0.384), ('delay', 0.374), ('hippocampal', 0.305), ('microstimulus', 0.226), ('cs', 0.149), ('stimulus', 0.139), ('isi', 0.109), ('hippocampus', 0.107), ('ludvig', 0.102), ('responding', 0.099), ('temporal', 0.091), ('damage', 0.088), ('ms', 0.07), ('eyeblink', 0.068), ('solomon', 0.068), ('sutton', 0.066), ('td', 0.066), ('lesions', 0.059), ('cr', 0.059), ('beylin', 0.057), ('disterhoft', 0.057), ('moyer', 0.057), ('barto', 0.051), ('cit', 0.049), ('contiguity', 0.049), ('schmajuk', 0.049), ('memory', 0.049), ('panel', 0.048), ('et', 0.046), ('hpc', 0.045), ('mcechron', 0.045), ('prediction', 0.041), ('traces', 0.04), ('shorter', 0.038), ('presence', 0.038), ('height', 0.037), ('isis', 0.036), ('early', 0.035), ('fear', 0.034), ('grossberg', 0.034), ('kehoe', 0.034), ('rabbits', 0.034), ('tseng', 0.034), ('yamazaki', 0.034), ('decay', 0.034), ('elements', 0.033), ('timing', 0.03), ('topography', 0.03), ('acquisition', 0.03), ('schultz', 0.029), ('late', 0.028), ('vt', 0.028), ('conditioned', 0.028), ('simulated', 0.027), ('us', 0.027), ('tanaka', 0.027), ('magnitude', 0.027), ('wt', 0.027), ('normal', 0.026), ('levy', 0.025), ('rodriguez', 0.025), ('onset', 0.025), ('neuroscience', 0.025), ('basis', 0.025), ('weights', 0.025), ('reward', 0.025), ('ideal', 0.024), ('eligibility', 0.024), ('impaired', 0.024), ('suppress', 0.024), ('selective', 0.023), ('dopamine', 0.023), ('briefer', 0.023), ('dicarlo', 0.023), ('hippocampectomy', 0.023), ('machado', 0.023), ('neurogenesis', 0.023), ('nictitating', 0.023), ('pavlov', 0.023), ('port', 0.023), ('punctate', 0.023), ('rabbit', 0.023), ('schneiderman', 0.023), ('shors', 0.023), ('wallenstein', 0.023), ('xt', 0.021), ('intervals', 0.021), ('trial', 0.021), ('representation', 0.021), ('longest', 0.02), ('representational', 0.02), ('crs', 0.02), ('depicts', 0.02), ('disrupts', 0.02), ('impairs', 0.02), ('lasted', 0.02), ('wholes', 0.02), ('behavioral', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 7 nips-2008-A computational model of hippocampal function in trace conditioning

Author: Elliot A. Ludvig, Richard S. Sutton, Eric Verbeek, E. J. Kehoe

2 0.079874784 24 nips-2008-An improved estimator of Variance Explained in the presence of noise

Author: Ralf M. Haefner, Bruce G. Cumming

Abstract: A crucial part of developing mathematical models of information processing in the brain is the quantiﬁcation of their success. One of the most widely-used metrics yields the percentage of the variance in the data that is explained by the model. Unfortunately, this metric is biased due to the intrinsic variability in the data. We derive a simple analytical modiﬁcation of the traditional formula that signiﬁcantly improves its accuracy (as measured by bias) with similar or better precision (as measured by mean-square error) in estimating the true underlying Variance Explained by the model class. Our estimator advances on previous work by a) accounting for overﬁtting due to free model parameters mitigating the need for a separate validation data set, b) adjusting for the uncertainty in the noise estimate and c) adding a conditioning term. We apply our new estimator to binocular disparity tuning curves of a set of macaque V1 neurons and ﬁnd that on a population level almost all of the variance unexplained by Gabor functions is attributable to noise. 1

3 0.079300135 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement

Author: Michael T. Todd, Yael Niv, Jonathan D. Cohen

Abstract: Working memory is a central topic of cognitive neuroscience because it is critical for solving real-world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating framework [14] is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We unite Gating with machine learning theory concerning the general problem of memory-based optimal control [5-6]. We present a normative model that learns, by online temporal difference methods, to use working memory to maximize discounted future reward in partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in humans. Our purpose is to introduce a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards. 1 I n t ro d u c t i o n Working memory is loosely defined in cognitive neuroscience as information that is (1) internally maintained on a temporary or short term basis, and (2) required for tasks in which immediate observations cannot be mapped to correct actions. It is widely assumed that prefrontal cortex (PFC) plays a role in maintaining and updating working memory. However, relatively little is known about how PFC develops useful working memory representations for a new task. Furthermore, current work focuses on describing the structure and limitations of working memory, but does not ask why, or in what general class of tasks, is it necessary. Borrowing from the theory of optimal control in partially observable Markov decision problems (POMDPs), we frame the psychological concept of working memory as an internal state representation, developed and employed to maximize future reward in partially observable environments. We combine computational insights from POMDPs and neurobiologically plausible models from cognitive neuroscience to suggest a simple reinforcement learning (RL) model of working memory function that can be implemented through dopaminergic training of the basal ganglia and PFC. The Gating framework is a series of cognitive neuroscience models developed to explain how dopaminergic RL signals can shape useful working memory representations [1-4]. Computationally this framework models working memory as a collection of past observations, each of which can occasionally be replaced with the current observation, and addresses the problem of learning when to update each memory element versus maintaining it. In the original Gating model [1-2] the PFC contained a unitary working memory representation that was updated whenever a phasic dopamine (DA) burst occurred (e.g., due to unexpected reward or novelty). That model was the first to connect working memory and RL via the temporal difference (TD) model of DA firing [7-8], and thus to suggest how working memory might serve a normative purpose. However, that model had limited computational flexibility due to the unitary nature of the working memory (i.e., a singleobservation memory controlled by a scalar DA signal). More recent work [3-4] has partially repositioned the Gating framework within the Actor/Critic model of mesostriatal RL [9-10], positing memory updating as but another cortical action controlled by the dorsal striatal

4 0.076981507 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes

Author: Ben Calderhead, Mark Girolami, Neil D. Lawrence

Abstract: Identiﬁcation and comparison of nonlinear dynamical system models using noisy and sparse experimental data is a vital task in many ﬁelds, however current methods are computationally expensive and prone to error due in part to the nonlinear nature of the likelihood surfaces induced. We present an accelerated sampling procedure which enables Bayesian inference of parameters in nonlinear ordinary and delay differential equations via the novel use of Gaussian processes (GP). Our method involves GP regression over time-series data, and the resulting derivative and time delay estimates make parameter inference possible without solving the dynamical system explicitly, resulting in dramatic savings of computational time. We demonstrate the speed and statistical accuracy of our approach using examples of both ordinary and delay differential equations, and provide a comprehensive comparison with current state of the art methods. 1

5 0.069887578 231 nips-2008-Temporal Dynamics of Cognitive Control

Author: Jeremy Reynolds, Michael C. Mozer

Abstract: Cognitive control refers to the ﬂexible deployment of memory and attention in response to task demands and current goals. Control is often studied experimentally by presenting sequences of stimuli, some demanding a response, and others modulating the stimulus-response mapping. In these tasks, participants must maintain information about the current stimulus-response mapping in working memory. Prominent theories of cognitive control use recurrent neural nets to implement working memory, and optimize memory utilization via reinforcement learning. We present a novel perspective on cognitive control in which working memory representations are intrinsically probabilistic, and control operations that maintain and update working memory are dynamically determined via probabilistic inference. We show that our model provides a parsimonious account of behavioral and neuroimaging data, and suggest that it offers an elegant conceptualization of control in which behavior can be cast as optimal, subject to limitations on learning and the rate of information processing. Moreover, our model provides insight into how task instructions can be directly translated into appropriate behavior and then efﬁciently reﬁned with subsequent task experience. 1

6 0.067455038 206 nips-2008-Sequential effects: Superstition or rational behavior?

7 0.063667916 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor

8 0.061682031 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

9 0.058074269 67 nips-2008-Effects of Stimulus Type and of Error-Correcting Code Design on BCI Speller Performance

10 0.054776214 60 nips-2008-Designing neurophysiology experiments to optimally constrain receptive field models along parametric submanifolds

11 0.053005073 240 nips-2008-Tracking Changing Stimuli in Continuous Attractor Neural Networks

12 0.052204333 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation

13 0.049252767 109 nips-2008-Interpreting the neural code with Formal Concept Analysis

14 0.049066901 204 nips-2008-Self-organization using synaptic plasticity

15 0.047755525 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

16 0.045419026 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG

17 0.044260073 172 nips-2008-Optimal Response Initiation: Why Recent Experience Matters

18 0.043940566 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning

19 0.043662582 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields

20 0.042330336 90 nips-2008-Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.096), (1, 0.081), (2, 0.073), (3, 0.028), (4, 0.004), (5, 0.036), (6, -0.017), (7, 0.025), (8, 0.057), (9, 0.03), (10, -0.008), (11, 0.057), (12, -0.089), (13, -0.003), (14, 0.043), (15, 0.06), (16, 0.064), (17, 0.017), (18, 0.007), (19, -0.103), (20, -0.068), (21, 0.007), (22, -0.003), (23, 0.042), (24, -0.04), (25, 0.042), (26, 0.003), (27, -0.034), (28, 0.033), (29, -0.075), (30, -0.041), (31, -0.119), (32, -0.034), (33, -0.118), (34, 0.012), (35, 0.024), (36, -0.037), (37, -0.029), (38, 0.039), (39, 0.09), (40, 0.117), (41, -0.009), (42, 0.046), (43, -0.036), (44, -0.06), (45, -0.011), (46, -0.007), (47, 0.02), (48, 0.076), (49, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97256249 7 nips-2008-A computational model of hippocampal function in trace conditioning

Author: Elliot A. Ludvig, Richard S. Sutton, Eric Verbeek, E. J. Kehoe

2 0.56579566 172 nips-2008-Optimal Response Initiation: Why Recent Experience Matters

Author: Matt Jones, Sachiko Kinoshita, Michael C. Mozer

Abstract: In most cognitive and motor tasks, speed-accuracy tradeoffs are observed: Individuals can respond slowly and accurately, or quickly yet be prone to errors. Control mechanisms governing the initiation of behavioral responses are sensitive not only to task instructions and the stimulus being processed, but also to the recent stimulus history. When stimuli can be characterized on an easy-hard dimension (e.g., word frequency in a naming task), items preceded by easy trials are responded to more quickly, and with more errors, than items preceded by hard trials. We propose a rationally motivated mathematical model of this sequential adaptation of control, based on a diffusion model of the decision process in which difﬁculty corresponds to the drift rate for the correct response. The model assumes that responding is based on the posterior distribution over which response is correct, conditioned on the accumulated evidence. We derive this posterior as a function of the drift rate, and show that higher estimates of the drift rate lead to (normatively) faster responding. Trial-by-trial tracking of difﬁculty thus leads to sequential effects in speed and accuracy. Simulations show the model explains a variety of phenomena in human speeded decision making. We argue this passive statistical mechanism provides a more elegant and parsimonious account than extant theories based on elaborate control structures. 1

3 0.55427045 231 nips-2008-Temporal Dynamics of Cognitive Control

Author: Jeremy Reynolds, Michael C. Mozer

4 0.55223638 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning

Author: Carmen Sandi, Wulfram Gerstner, Gediminas Lukšys

Abstract: Suppose we train an animal in a conditioning experiment. Can one predict how a given animal, under given experimental conditions, would perform the task? Since various factors such as stress, motivation, genetic background, and previous errors in task performance can inﬂuence animal behaviour, this appears to be a very challenging aim. Reinforcement learning (RL) models have been successful in modeling animal (and human) behaviour, but their success has been limited because of uncertainty as to how to set meta-parameters (such as learning rate, exploitation-exploration balance and future reward discount factor) that strongly inﬂuence model performance. We show that a simple RL model whose metaparameters are controlled by an artiﬁcial neural network, fed with inputs such as stress, affective phenotype, previous task performance, and even neuromodulatory manipulations, can successfully predict mouse behaviour in the ”hole-box” - a simple conditioning task. Our results also provide important insights on how stress and anxiety affect animal learning, performance accuracy, and discounting of future rewards, and on how noradrenergic systems can interact with these processes. 1

5 0.51558912 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement

Author: Michael T. Todd, Yael Niv, Jonathan D. Cohen

6 0.49251205 109 nips-2008-Interpreting the neural code with Formal Concept Analysis

7 0.49202463 67 nips-2008-Effects of Stimulus Type and of Error-Correcting Code Design on BCI Speller Performance

8 0.48564711 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models

9 0.4824321 240 nips-2008-Tracking Changing Stimuli in Continuous Attractor Neural Networks

10 0.44162983 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor

11 0.42533872 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation

12 0.42221075 206 nips-2008-Sequential effects: Superstition or rational behavior?

13 0.40741274 24 nips-2008-An improved estimator of Variance Explained in the presence of noise

14 0.40545556 90 nips-2008-Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity

15 0.38005316 124 nips-2008-Load and Attentional Bayes

16 0.37265956 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction

17 0.34992638 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

18 0.33763433 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making

19 0.33280456 27 nips-2008-Artificial Olfactory Brain for Mixture Identification

20 0.32601923 8 nips-2008-A general framework for investigating how far the decoding process in the brain can be simplified

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.033), (6, 0.051), (7, 0.028), (12, 0.028), (15, 0.026), (28, 0.122), (57, 0.044), (59, 0.023), (63, 0.017), (69, 0.413), (71, 0.012), (77, 0.044), (78, 0.023), (83, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78178912 7 nips-2008-A computational model of hippocampal function in trace conditioning

Author: Elliot A. Ludvig, Richard S. Sutton, Eric Verbeek, E. J. Kehoe

2 0.51971567 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform

Author: Guangzhi Cao, Charles Bouman

Abstract: Covariance estimation for high dimensional vectors is a classically difﬁcult problem in statistical analysis and machine learning. In this paper, we propose a maximum likelihood (ML) approach to covariance estimation, which employs a novel sparsity constraint. More speciﬁcally, the covariance is constrained to have an eigen decomposition which can be represented as a sparse matrix transform (SMT). The SMT is formed by a product of pairwise coordinate rotations known as Givens rotations. Using this framework, the covariance can be efﬁciently estimated using greedy minimization of the log likelihood function, and the number of Givens rotations can be efﬁciently computed using a cross-validation procedure. The resulting estimator is positive deﬁnite and well-conditioned even when the sample size is limited. Experiments on standard hyperspectral data sets show that the SMT covariance estimate is consistently more accurate than both traditional shrinkage estimates and recently proposed graphical lasso estimates for a variety of different classes and sample sizes. 1

3 0.49632198 40 nips-2008-Bounds on marginal probability distributions

Author: Joris M. Mooij, Hilbert J. Kappen

Abstract: We propose a novel bound on single-variable marginal probability distributions in factor graphs with discrete variables. The bound is obtained by propagating local bounds (convex sets of probability distributions) over a subtree of the factor graph, rooted in the variable of interest. By construction, the method not only bounds the exact marginal probability distribution of a variable, but also its approximate Belief Propagation marginal (“belief”). Thus, apart from providing a practical means to calculate bounds on marginals, our contribution also lies in providing a better understanding of the error made by Belief Propagation. We show that our bound outperforms the state-of-the-art on some inference problems arising in medical diagnosis. 1

4 0.34753373 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework

Author: Matthew Botvinick, James An

Abstract: Research in animal learning and behavioral neuroscience has distinguished between two forms of action control: a habit-based form, which relies on stored action values, and a goal-directed form, which forecasts and compares action outcomes based on a model of the environment. While habit-based control has been the subject of extensive computational research, the computational principles underlying goal-directed control in animals have so far received less attention. In the present paper, we advance a computational framework for goal-directed control in animals and humans. We take three empirically motivated points as founding premises: (1) Neurons in dorsolateral prefrontal cortex represent action policies, (2) Neurons in orbitofrontal cortex represent rewards, and (3) Neural computation, across domains, can be appropriately understood as performing structured probabilistic inference. On a purely computational level, the resulting account relates closely to previous work using Bayesian inference to solve Markov decision problems, but extends this work by introducing a new algorithm, which provably converges on optimal plans. On a cognitive and neuroscientific level, the theory provides a unifying framework for several different forms of goal-directed action selection, placing emphasis on a novel form, within which orbitofrontal reward representations directly drive policy selection. 1 G oal- d irect ed act i on cont rol In the study of human and animal behavior, it is a long-standing idea that reward-based decision making may rely on two qualitatively different mechanisms. In habit-based decision making, stimuli elicit reflex-like responses, shaped by past reinforcement [1]. In goal-directed or purposive decision making, on the other hand, actions are selected based on a prospective consideration of possible outcomes and future lines of action [2]. Over the past twenty years or so, the attention of cognitive neuroscientists and computationally minded psychologists has tended to focus on habit-based control, due in large part to interest in potential links between dopaminergic function and temporal-difference algorithms for reinforcement learning. However, a resurgence of interest in purposive action selection is now being driven by innovations in animal behavior research, which have yielded powerful new behavioral assays [3], and revealed specific effects of focal neural damage on goaldirected behavior [4]. In discussing some of the relevant data, Daw, Niv and Dayan [5] recently pointed out the close relationship between purposive decision making, as understood in the behavioral sciences, and model-based methods for the solution of Markov decision problems (MDPs), where action policies are derived from a joint analysis of a transition function (a mapping from states and actions to outcomes) and a reward function (a mapping from states to rewards). Beyond this important insight, little work has yet been done to characterize the computations underlying goal-directed action selection (though see [6, 7]). As discussed below, a great deal of evidence indicates that purposive action selection depends critically on a particular region of the brain, the prefrontal cortex. However, it is currently a critical, and quite open, question what the relevant computations within this part of the brain might be. Of course, the basic computational problem of formulating an optimal policy given a model of an MDP has been extensively studied, and there is no shortage of algorithms one might consider as potentially relevant to prefrontal function (e.g., value iteration, policy iteration, backward induction, linear programming, and others). However, from a cognitive and neuroscientific perspective, there is one approach to solving MDPs that it seems particularly appealing to consider. In particular, several researchers have suggested methods for solving MDPs through probabilistic inference [8-12]. The interest of this idea, in the present context, derives from a recent movement toward framing human and animal information processing, as well as the underlying neural computations, in terms of structured probabilistic inference [13, 14]. Given this perspective, it is inviting to consider whether goal-directed action selection, and the neural mechanisms that underlie it, might be understood in those same terms. One challenge in investigating this possibility is that previous research furnishes no ‘off-theshelf’ algorithm for solving MDPs through probabilistic inference that both provably yields optimal policies and aligns with what is known about action selection in the brain. We endeavor here to start filling in that gap. In the following section, we introduce an account of how goal-directed action selection can be performed based on probabilisitic inference, within a network whose components map grossly onto specific brain structures. As part of this account, we introduce a new algorithm for solving MDPs through Bayesian inference, along with a convergence proof. We then present results from a set of simulations illustrating how the framework would account for a variety of behavioral phenomena that are thought to involve purposive action selection. 2 Co m p u t a t i o n a l m o d el As noted earlier, the prefrontal cortex (PFC) is believed to play a pivotal role in purposive behavior. This is indicated by a broad association between prefrontal lesions and impairments in goal-directed action in both humans (see [15]) and animals [4]. Single-unit recording and other data suggest that different sectors of PFC make distinct contributions. In particular, neurons in dorsolateral prefrontal cortex (DLPFC) appear to encode taskspecific mappings from stimuli to responses (e.g., [16]): “task representations,” in the language of psychology, or “policies” in the language of dynamic programming. Although there is some understanding of how policy representations in DLPFC may guide action execution [15], little is yet known about how these representations are themselves selected. Our most basic proposal is that DLPFC policy representations are selected in a prospective, model-based fashion, leveraging information about action-outcome contingencies (i.e., the transition function) and about the incentive value associated with specific outcomes or states (the reward function). There is extensive evidence to suggest that state-reward associations are represented in another area of the PFC, the orbitofrontal cortex (OFC) [17, 18]. As for the transition function, although it is clear that the brain contains detailed representations of action-outcome associations [19], their anatomical localization is not yet entirely clear. However, some evidence suggests that the enviromental effects of simple actions may be represented in inferior fronto-parietal cortex [20], and there is also evidence suggesting that medial temporal structures may be important in forecasting action outcomes [21]. As detailed in the next section, our model assumes that policy representations in DLPFC, reward representations in OFC, and representations of states and actions in other brain regions, are coordinated within a network structure that represents their causal or statistical interdependencies, and that policy selection occurs, within this network, through a process of probabilistic inference. 2.1 A rc h i t e c t u re The implementation takes the form of a directed graphical model [22], with the layout shown in Figure 1. Each node represents a discrete random variable. State variables (s), representing the set of m possible world states, serve the role played by parietal and medial temporal cortices in representing action outcomes. Action variables (a) representing the set of available actions, play the role of high-level cortical motor areas involved in the programming of action sequences. Policy variables ( ), each repre-senting the set of all deterministic policies associated with a specific state, capture the representational role of DLPFC. Local and global utility variables, described further Fig 1. Left: Single-step decision. Right: Sequential decision. below, capture the role of OFC in Each time-slice includes a set of m policy nodes. representing incentive value. A separate set of nodes is included for each discrete time-step up to the planning horizon. The conditional probabilities associated with each variable are represented in tabular form. State probabilities are based on the state and action variables in the preceding time-step, and thus encode the transition function. Action probabilities depend on the current state and its associated policy variable. Utilities depend only on the current state. Rather than representing reward magnitude as a continuous variable, we adopt an approach introduced by [23], representing reward through the posterior probability of a binary variable (u). States associated with large positive reward raise p(u) (i.e, p(u=1|s)) near to one; states associated with large negative rewards reduce p(u) to near zero. In the simulations reported below, we used a simple linear transformation to map from scalar reward values to p(u): p (u si ) = 1 R ( si ) +1 , rmax 2 rmax max j R ( s j ) (1) In situations involving sequential actions, expected returns from different time-steps must be integrated into a global representation of expected value. In order to accomplish this, we employ a technique proposed by [8], introducing a “global” utility variable (u G). Like u, this 1 is a binary random variable, but associated with a posterior probability determined as: p (uG ) = 1 N p(u i ) (2) i where N is the number of u nodes. The network as whole embodies a generative model for instrumental action. The basic idea is to use this model as a substrate for probabilistic inference, in order to arrive at optimal policies. There are three general methods for accomplishing this, which correspond three forms of query. First, a desired outcome state can be identified, by treating one of the state variables (as well as the initial state variable) as observed (see [9] for an application of this approach). Second, the expected return for specific plans can be evaluated and compared by conditioning on specific sets of values over the policy nodes (see [5, 21]). However, our focus here is on a less obvious possibility, which is to condition directly on the utility variable u G , as explained next. 2.2 P o l i c y s e l e c t i o n b y p ro b a b i l i s t i c i n f e re n c e : a n i t e r a t i v e a l g o r i t h m Cooper [23] introduced the idea of inferring optimal decisions in influence diagrams by treating utility nodes into binary random variables and then conditioning on these variables. Although this technique has been adopted in some more recent work [9, 12], we are aware of no application that guarantees optimal decisions, in the expected-reward sense, in multi-step tasks. We introduce here a simple algorithm that does furnish such a guarantee. The procedure is as follows: (1) Initialize the policy nodes with any set of non-deterministic 2 priors. (2) Treating the initial state and u G as observed variables (u G = 1), use standard belief 1 Note that temporal discounting can be incorporated into the framework through minimal modifications to Equation 2. 2 In the single-action situation, where there is only one u node, it is this variable that is treated as observed (u = 1). propagation (or a comparable algorithm) to infer the posterior distributions over all policy nodes. (3) Set the prior distributions over the policy nodes to the values (posteriors) obtained in step 2. (4) Go to step 2. The next two sections present proofs of monotonicity and convergence for this algorithm. 2.2.1 Monotonicity We show first that, at each policy node, the probability associated with the optimal policy will rise on every iteration. Define * as follows: ( * p uG , + ) > p (u + , G ), * (3) where + is the current set of probability distributions at all policy nodes on subsequent time-steps. (Note that we assume here, for simplicity, that there is a unique optimal policy.) The objective is to establish that: p ( t* ) > p ( t* 1 ) (4) where t indexes processing iterations. The dynamics of the network entail that p( ) = p( t t 1 uG ) (5) where represents any value (i.e., policy) of the decision node being considered. Substituting this into (4) gives p t* 1 uG > p ( t* 1 ) (6) ( ) From this point on the focus is on a single iteration, which permits us to omit the relevant subscripts. Applying Bayes’ law to (6) yields p (uG * p (uG ) p( ) > p * )p ( ) ( ) * (7) Canceling, and bringing the denominator up, this becomes p (uG * )> p (uG ) p( ) (8) Rewriting the left hand side, we obtain p ( uG * ) p( ) > p (uG ) p( ) (9) Subtracting and further rearranging: p (uG p (uG * ) p ( uG * * * ) p (uG ) p( ) + p (uG * * ) * p ( uG ) p( ) > 0 p (uG * ) p (uG ) p( ) > 0 (10) ) p( ) > 0 (11) (12) Note that this last inequality (12) follows from the definition of *. Remark: Of course, the identity of * depends on +. In particular, the policy * will only be part of a globally optimal plan if the set of choices + is optimal. Fortunately, this requirement is guaranteed to be met, as long as no upper bound is placed on the number of processing cycles. Recalling that we are considering only finite-horizon problems, note that for policies leading to states with no successors, + is empty. Thus * at the relevant policy nodes is fixed, and is guaranteed to be part of the optimal policy. The proof above shows that * will continuously rise. Once it reaches a maximum, * at immediately preceding decisions will perforce fit with the globally optimal policy. The process works backward, in the fashion of backward induction. 2.2.2 Convergence Continuing with the same notation, we show now that pt ( limt uG ) = 1 * (13) Note that, if we apply Bayes’ law recursively, pt ( uG ) = ( ) p ( ) = p (u p uG t ) G pi (uG ) 2 pt pi (uG ) pt 1 ( ( )= ) p uG 1 ( uG ) pt (uG ) pt 3 pt 2 ( ) 1 ( u G ) pt 2 ( u G ) … (14) Thus, p1 ( uG ) = ( p uG ) p ( ), p ( p (u ) 1 uG ) = 2 1 G 2 ( ) p ( ), p uG 1 p2 (uG ) p1 (uG ) p3 ( 3 ( ) p( ) p uG uG ) = 1 p3 (uG ) p2 (uG ) p1 (uG ) , (15) and so forth. Thus, what we wish to prove is ( * p uG ) p ( ) =1 * 1 (16) pt (uG ) t =1 or, rearranging, pt (uG ) ( = p1 ( ) p uG t =1 (17) ). Note that, given the stipulated relationship between p( ) on each processing iteration and p( | uG) on the previous iteration, p (uG pt (uG ) = )p ( ) = p ( uG = pt 1 )p ( p (uG t uG ) = t 1 3 )p ( ) 4 p (uG t 1 = (uG ) pt 2 (uG ) pt 1 ) pt 2 p (uG )p ( ) t 1 pt 1 1 ( ) (uG ) pt 2 (uG ) pt 3 (uG ) ( uG ) (18) … With this in mind, we can rewrite the left hand side product in (17) as follows: p ( uG p1 (uG ) ( p uG ) p (u G 2 )p( ) ) p (u 1 G ) 3 p (uG 1 ( p uG )p( ) ) p (u 1 G 4 p (uG 1 ( ) p2 (uG ) p uG ) p (u 1 ) p( ) 1 G ) p2 (uG ) p3 (uG ) … (19) Note that, given (18), the numerator in each factor of (19) cancels with the denominator in the subsequent factor, leaving only p(uG| *) in that denominator. The expression can thus be rewritten as 1 ( p uG 1 ) p (u G ) p (u G 4 p (uG 1 ) ) p( ) 1 ( p uG ) … = p (uG ( p uG ) ) p1 ( ). (20) The objective is then to show that the above equals p( *). It proceeds directly from the definition of * that, for all other than *, p ( uG ( p uG ) ) <1 (21) Thus, all but one of the terms in the sum above approach zero, and the remaining term equals p1( *). Thus, p (uG ( p uG ) ) p1 ( ) = p1 ( ) (22) 3 Simulations 3.1 Binary choice We begin with a simulation of a simple incentive choice situation. Here, an animal faces two levers. Pressing the left lever reliably yields a preferred food (r = 2), the right a less preferred food (r = 1). Representing these contingencies in a network structured as in Fig. 1 (left) and employing the iterative algorithm described in section 2.2 yields the results in Figure 2A. Shown here are the posterior probabilities for the policies press left and press right, along with the marginal value of p(u = 1) under these posteriors (labeled EV for expected value). The dashed horizontal line indicates the expected value for the optimal plan, to which the model obviously converges. A key empirical assay for purposive behavior involves outcome devaluation. Here, actions yielding a previously valued outcome are abandoned after the incentive value of the outcome is reduced, for example by pairing with an aversive event (e.g., [4]). To simulate this within the binary choice scenario just described, we reduced to zero the reward value of the food yielded by the left lever (fL), by making the appropriate change to p(u|fL). This yielded a reversal in lever choice (Fig. 2B). Another signature of purposive actions is that they are abandoned when their causal connection with rewarding outcomes is removed (contingency degradation, see [4]). We simulated this by starting with the model from Fig. 2A and changing conditional probabilities at s for t=2 to reflect a decoupling of the left action from the fL outcome. The resulting behavior is shown in Fig. 2C. Fig 2. Simulation results, binary choice. 3.2 Stochastic outcomes A critical aspect of the present modeling paradigm is that it yields reward-maximizing choices in stochastic domains, a property that distinguishes it from some other recent approaches using graphical models to do planning (e.g., [9]). To illustrate, we used the architecture in Figure 1 (left) to simulate a choice between two fair coins. A ‘left’ coin yields $1 for heads, $0 for tails; a ‘right’ coin $2 for heads but for tails a $3 loss. As illustrated in Fig. 2D, the model maximizes expected value by opting for the left coin. Fig 3. Simulation results, two-step sequential choice. 3.3 Sequential decision Here, we adopt the two-step T-maze scenario used by [24] (Fig. 3A). Representing the task contingencies in a graphical model based on the template from Fig 1 (right), and using the reward values indicated in Fig. 3A, yields the choice behavior shown in Figure 3B. Following [24], a shift in motivational state from hunger to thirst can be represented in the graphical model by changing the reward function (R(cheese) = 2, R(X) = 0, R(water) = 4, R(carrots) = 1). Imposing this change at the level of the u variables yields the choice behavior shown in Fig. 3C. The model can also be used to simulate effort-based decision. Starting with the scenario in Fig. 2A, we simulated the insertion of an effort-demanding scalable barrier at S 2 (R(S 2 ) = -2) by making appropriate changes p(u|s). The resulting behavior is shown in Fig. 3D. A famous empirical demonstration of purposive control involves detour behavior. Using a maze like the one shown in Fig. 4A, with a food reward placed at s5 , Tolman [2] found that rats reacted to a barrier at location A by taking the upper route, but to a barrier at B by taking the longer lower route. We simulated this experiment by representing the corresponding 3 transition and reward functions in a graphical model of the form shown in Fig. 1 (right), representing the insertion of barriers by appropriate changes to the transition function. The resulting choice behavior at the critical juncture s2 is shown in Fig. 4. Fig 4. Simulation results, detour behavior. B: No barrier. C: Barrier at A. D: Barrier at B. Another classic empirical demonstration involves latent learning. Blodgett [25] allowed rats to explore the maze shown in Fig. 5. Later insertion of a food reward at s13 was followed immediately by dramatic reductions in the running time, reflecting a reduction in entries into blind alleys. We simulated this effect in a model based on the template in Fig. 1 (right), representing the maze layout via an appropriate transition function. In the absence of a reward at s12 , random choices occurred at each intersection. However, setting R(s13 ) = 1 resulted in the set of choices indicated by the heavier arrows in Fig. 5. 4 Fig 5. Latent learning. Rel a t i o n t o p revi o u s work Initial proposals for how to solve decision problems through probabilistic inference in graphical models, including the idea of encoding reward as the posterior probability of a random utility variable, were put forth by Cooper [23]. Related ideas were presented by Shachter and Peot [12], including the use of nodes that integrate information from multiple utility nodes. More recently, Attias [11] and Verma and Rao [9] have used graphical models to solve shortest-path problems, leveraging probabilistic representations of rewards, though not in a way that guaranteed convergence on optimal (reward maximizing) plans. More closely related to the present research is work by Toussaint and Storkey [10], employing the EM algorithm. The iterative approach we have introduced here has a certain resemblance to the EM procedure, which becomes evident if one views the policy variables in our models as parameters on the mapping from states to actions. It seems possible that there may be a formal equivalence between the algorithm we have proposed and the one reported by [10]. As a cognitive and neuroscientific proposal, the present work bears a close relation to recent work by Hasselmo [6], addressing the prefrontal computations underlying goal-directed action selection (see also [7]). The present efforts are tied more closely to normative principles of decision-making, whereas the work in [6] is tied more closely to the details of neural circuitry. In this respect, the two approaches may prove complementary, and it will be interesting to further consider their interrelations. 3 In this simulation and the next, the set of states associated with each state node was limited to the set of reachable states for the relevant time-step, assuming an initial state of s1 . Acknowledgments Thanks to Andrew Ledvina, David Blei, Yael Niv, Nathaniel Daw, and Francisco Pereira for useful comments. R e f e re n c e s [1] Hull, C.L., Principles of Behavior. 1943, New York: Appleton-Century. [2] Tolman, E.C., Purposive Behavior in Animals and Men. 1932, New York: Century. [3] Dickinson, A., Actions and habits: the development of behavioral autonomy. Philosophical Transactions of the Royal Society (London), Series B, 1985. 308: p. 67-78. [4] Balleine, B.W. and A. Dickinson, Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology, 1998. 37: p. 407-419. [5] Daw, N.D., Y. Niv, and P. Dayan, Uncertainty-based competition between prefrontal and striatal systems for behavioral control. Nature Neuroscience, 2005. 8: p. 1704-1711. [6] Hasselmo, M.E., A model of prefrontal cortical mechanisms for goal-directed behavior. Journal of Cognitive Neuroscience, 2005. 17: p. 1115-1129. [7] Schmajuk, N.A. and A.D. Thieme, Purposive behavior and cognitive mapping. A neural network model. Biological Cybernetics, 1992. 67: p. 165-174. [8] Tatman, J.A. and R.D. Shachter, Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man and Cybernetics, 1990. 20: p. 365-379. [9] Verma, D. and R.P.N. Rao. Planning and acting in uncertain enviroments using probabilistic inference. in IEEE/RSJ International Conference on Intelligent Robots and Systems. 2006. [10] Toussaint, M. and A. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. in Proceedings of the 23rd International Conference on Machine Learning. 2006. Pittsburgh, PA. [11] Attias, H. Planning by probabilistic inference. in Proceedings of the 9th Int. Workshop on Artificial Intelligence and Statistics. 2003. [12] Shachter, R.D. and M.A. Peot. Decision making using probabilistic inference methods. in Uncertainty in artificial intelligence: Proceedings of the Eighth Conference (1992). 1992. Stanford University: M. Kaufmann. [13] Chater, N., J.B. Tenenbaum, and A. Yuille, Probabilistic models of cognition: conceptual foundations. Trends in Cognitive Sciences, 2006. 10(7): p. 287-291. [14] Doya, K., et al., eds. The Bayesian Brain: Probabilistic Approaches to Neural Coding. 2006, MIT Press: Cambridge, MA. [15] Miller, E.K. and J.D. Cohen, An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 2001. 24: p. 167-202. [16] Asaad, W.F., G. Rainer, and E.K. Miller, Task-specific neural activity in the primate prefrontal cortex. Journal of Neurophysiology, 2000. 84: p. 451-459. [17] Rolls, E.T., The functions of the orbitofrontal cortex. Brain and Cognition, 2004. 55: p. 11-29. [18] Padoa-Schioppa, C. and J.A. Assad, Neurons in the orbitofrontal cortex encode economic value. Nature, 2006. 441: p. 223-226. [19] Gopnik, A., et al., A theory of causal learning in children: causal maps and Bayes nets. Psychological Review, 2004. 111: p. 1-31. [20] Hamilton, A.F.d.C. and S.T. Grafton, Action outcomes are represented in human inferior frontoparietal cortex. Cerebral Cortex, 2008. 18: p. 1160-1168. [21] Johnson, A., M.A.A. van der Meer, and D.A. Redish, Integrating hippocampus and striatum in decision-making. Current Opinion in Neurobiology, 2008. 17: p. 692-697. [22] Jensen, F.V., Bayesian Networks and Decision Graphs. 2001, New York: Springer Verlag. [23] Cooper, G.F. A method for using belief networks as influence diagrams. in Fourth Workshop on Uncertainty in Artificial Intelligence. 1988. University of Minnesota, Minneapolis. [24] Niv, Y., D. Joel, and P. Dayan, A normative perspective on motivation. Trends in Cognitive Sciences, 2006. 10: p. 375-381. [25] Blodgett, H.C., The effect of the introduction of reward upon the maze performance of rats. University of California Publications in Psychology, 1929. 4: p. 113-134.

5 0.34662113 195 nips-2008-Regularized Policy Iteration

Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári

Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a ﬂexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efﬁcient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide ﬁnite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1

6 0.34514087 231 nips-2008-Temporal Dynamics of Cognitive Control

7 0.34437737 29 nips-2008-Automatic online tuning for fast Gaussian summation

8 0.34259376 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

9 0.34255978 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression

10 0.34201756 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

11 0.34198895 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning

12 0.34071288 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

13 0.34042943 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

14 0.33977103 216 nips-2008-Sparse probabilistic projections

15 0.33928171 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

16 0.33875424 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

17 0.33874276 24 nips-2008-An improved estimator of Variance Explained in the presence of noise

18 0.33857819 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

19 0.33839723 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

20 0.33839244 202 nips-2008-Robust Regression and Lasso