nips nips2011 nips2011-215 nips2011-215-reference knowledge-graph by maker-knowledge-mining

215 nips-2011-Policy Gradient Coagent Networks

Source: pdf

Author: Philip S. Thomas

Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difﬁcult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1

reference text

[1] S. Amari. Natural gradient works efﬁciently in learning. Neural Computation, 10(2):251–276, 1998.

[2] S. Amari and S. Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’98), volume 2, pages 1213–1216, 1998.

[3] A. G. Barto. Learning by statistical cooperation of self-interested neuron-like computing elements. Human Neurobiology, 4:229–256, 1985.

[4] A. G. Barto. Adaptive critics and the basal ganglia. Models of Information Processing in the Basal Ganglia, pages 215–232, 1995.

[5] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009.

[6] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Technical Report TR09-10, University of Alberta Department of Computing Science, June 2009.

[7] A. Claridge-Chang, R. Roorda, E. Vrontou, L. Sjulson, H. Li, J. Hirsh, and G. Miesenbock. Writing memories with light-addressable reinforcement circuitry. Cell, 193(2):405–415, 2009.

[8] F. H. C. Crick. The recent excitement about neural networks. Nature, 337:129–132, 1989.

[9] N. Daw and K. Doya. The computational neurobiology of learning and reward. Current Opinion in Neurobiology, 16:199–204, 2006.

[10] K. Doya. What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Networks, 12:961–974, 1999.

[11] K. Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245, 2000.

[12] M. J. Frank and E. D. Claus. Anatomy of a decision: Striato-orbitofrontal interactions in reinforcement learning, decision making, and reversal. Psychological Review, 113(2):300–326, 2006.

[13] E. Ludvig, R. Sutton, and E. Kehoe. Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Computation, 20:3034–3035, 2008.

[14] R. C. O’Reilly. The LEABRA model of neural interactions and learning in the neocortex. PhD thesis, Carnegie Mellon University.

[15] J. Peters and S. Schaal. Natural actor critic. Neurocomputing, 71:1180–1190, 2008.

[16] F. Rivest, Y. Bengio, and J. Kalaska. Brain inspired reinforcement learning. In Advances in Neural Information Processing Systems, pages 1129–1136, 2005.

[17] D. E. Rumelhart and J. L. McClelland. Parallel distributed processing. Volume 1: Foundations. MIT Press, Cambridge, MA, 1986.

[18] T. W. Sandholm and R. H. Crites. Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems, 37:147–166, 1996.

[19] W. Schultz, P. Dayan, and P. Montague. A neural substrate of prediction and reward. Science, 275:1593– 1599, 1992.

[20] A. Stocco, C. Lebiere, and J. Anderson. Conditional routing of information to the cortex: A model of the basal ganglia’s role in cognitive coordination. Psychological Review, 117(2):541–574, 2010.

[21] R. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.

[22] R. Sutton and A. Barto. Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88:135–140, 1981.

[23] R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, Cambridge, MA, 1998.

[24] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057– 1063, 2000.

[25] P. Thomas and A. Barto. Conjugate Markov decision processes. In Proceedings of the Twenty-Eighth International Conference on Machine Learning, 2011.

[26] R. J. Williams. A class of gradient-estimating algorithms for reinforcement learning in neural networks. In Proceedings of the IEEE First International Conference on Neural Networks, 1987.

[27] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992.

[28] D. Zipser and R. A. Andersen. A back propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature, 331:679–684, 1988. 9