nips nips2002 nips2002-199 nips2002-199-reference knowledge-graph by maker-knowledge-mining

199 nips-2002-Timing and Partial Observability in the Dopamine System


Source: pdf

Author: Nathaniel D. Daw, Aaron C. Courville, David S. Touretzky

Abstract: According to a series of influential models, dopamine (DA) neurons signal reward prediction error using a temporal-difference (TD) algorithm. We address a problem not convincingly solved in these accounts: how to maintain a representation of cues that predict delayed consequences. Our new model uses a TD rule grounded in partially observable semi-Markov processes, a formalism that captures two largely neglected features of DA experiments: hidden state and temporal variability. Previous models predicted rewards using a tapped delay line representation of sensory inputs; we replace this with a more active process of inference about the underlying state of the world. The DA system can then learn to map these inferred states to reward predictions using TD. The new model can explain previously vexing data on the responses of DA neurons in the face of temporal variability. By combining statistical model-based learning with a physiologically grounded TD theory, it also brings into contact with physiology some insights about behavior that had previously been confined to more abstract psychological models.


reference text

[1] JC Houk, JL Adams, and AG Barto. A model of how the basal ganglia generate and use neural signals that predict reinforcement. In JC Houk, JL Davis, and DG Beiser, editors, Models of Information Processing in the Basal Ganglia, pages 249–270. MIT Press, 1995.

[2] PR Montague, P Dayan, and TJ Sejnowski. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci, 16:1936–1947, 1996.

[3] W Schultz, P Dayan, and PR Montague. A neural substrate of prediction and reward. Science, 275:1593–1599, 1997.

[4] RE Suri and W Schultz. A neural network with dopamine-like reinforcement signal that learns a spatial delayed response task. Neurosci, 91:871–890, 1999.

[5] ND Daw and DS Touretzky. Long-term reward prediction in TD models of the dopamine system. Neural Comp, 14:2567–2583, 2002.

[6] RS Sutton. Learning to predict by the method of temporal differences. Machine Learning, 3:9–44, 1988.

[7] W Schultz. Predictive reward signal of dopamine neurons. J Neurophys, 80:1–27, 1998.

[8] RS Sutton and AG Barto. Time-derivative models of Pavlovian reinforcement. In M Gabriel and J Moore, editors, Learning and Computational Neuroscience: Foundations of Adaptive Networks, pages 497–537. MIT Press, 1990.

[9] JR Hollerman and W Schultz. Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neurosci, 1:304–309, 1998.

[10] DS Touretzky, ND Daw, and EJ Tira-Thompson. Combining configural and TD learning on a robot. In ICDL 2, pages 47–52. IEEE Computer Society, 2002.

[11] CD Fiorillo and W Schultz. The reward responses of dopamine neurons persist when prediction of reward is probabilistic with respect to time or occurrence. In Soc. Neurosci. Abstracts, volume 27: 827.5, 2001.

[12] LP Kaelbling, ML Littman, and AR Cassandra. Planning and acting in partially observable stochastic domains. Artif Intell, 101:99–134, 1998.

[13] SJ Bradtke and MO Duff. Reinforcement learning methods for continuous-time Markov Decision Problems. In NIPS 7, pages 393–400. MIT Press, 1995.

[14] L Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI 10, pages 183–188, 1992.

[15] AC Courville and DS Touretzky. Modeling temporal structure in classical conditioning. In NIPS 14, pages 3–10. MIT Press, 2001.

[16] S Kakade and P Dayan. Acquisition in autoshaping. In NIPS 12, pages 24–30. MIT Press, 2000.

[17] Y Guedon and C Cocozza-Thivent. Explicit state occupancy modeling by hidden semi-Markov models: Application of Derin’s scheme. Comp Speech and Lang, 4:167–192, 1990.

[18] CR Gallistel and J Gibbon. Time, rate and conditioning. Psych Rev, 107(2):289–344, 2000.

[19] RE Suri. Anticipatory responses of dopamine neurons and cortical neurons reproduced by internal model. Exp Brain Research, 140:234–240, 2001.

[20] P Dayan. Motivated reinforcement learning. In NIPS 14, pages 11–18. MIT Press, 2001.