nips nips2008 nips2008-223 nips2008-223-reference knowledge-graph by maker-knowledge-mining

223 nips-2008-Structure Learning in Human Sequential Decision-Making

Source: pdf

Author: Daniel Acuna, Paul R. Schrater

Abstract: We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently ﬁnd suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1

reference text

[1] Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete bayesian reinforcement learning. In 23rd International Conference on Machine Learning, Pittsburgh, Penn, 2006.

[2] Richard Ernest Bellman. Dynamic programming. Princeton University Press, Princeton, 1957.

[3] Noah Gans, George Knox, and Rachel Croson. Simple models of discrete choice and their performance in bandit experiments. Manufacturing and Service Operations Management, 9(4):383–408, 2007.

[4] C.M. Anderson. Behavioral Models of Strategies in Multi-Armed Bandit Problems. PhD thesis, Pasadena, CA., 2001.

[5] Jeffrey Banks, David Porter, and Mark Olson. An experimental analysis of the bandit problem. Economic Theory, 10(1):55–77, 1997.

[6] R. J. Meyer and Y. Shi. Sequential choice under ambiguity: Intuitive solutions to the armedbandit problem. Management Science, 41:817–83, 1995.

[7] N Vulkan. An economist’s perspective on probability matching. Journal of Economic Surveys, 14:101–118, 2000.

[8] Yvonne Brackbill and Anthony Bravos. Supplementary report: The utility of correctly predicting infrequent events. Journal of Experimental Psychology, 64(6):648–649, 1962.

[9] W Edwards. Probability learning in 1000 trials. Journal of Experimental Psychology, 62:385– 394, 1961.

[10] W Edwards. Reward probability, amount, and information as determiners of sequential twoalternative decisions. J Exp Psychol, 52(3):177–88, 1956.

[11] E. Fantino and A Esfandiari. Probability matching: Encouraging optimal responding in humans. Canadian Journal of Experimental Psychology, 56:58 – 63, 2002.

[12] Timothy E J Behrens, Mark W Woolrich, Mark E Walton, and Matthew F S Rushworth. Learning the value of information in an uncertain world. Nat Neurosci, 10(9):1214–1221, 2007.

[13] N. D. Daw, J. P. O’Doherty, P. Dayan, B. Seymour, and R. J. Dolan. Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879, 2006.

[14] JS Banks and RK Sundaram. A class of bandit problems yielding myopic optimal strategies. Journal of Applied Probability, 29(3):625–632, 1992.

[15] John Gittins and You-Gan Wang. The learning component of dynamic allocation indices. The Annals of Statistics, 20(2):1626–1636, 1992.

[16] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments. Progress in Statistics, pages 241–266, 1974.

[17] Joshua B. Tenenbaum, Thomas L. Grifﬁths, and Charles Kemp. Theory-based bayesian models of inductive learning and reasoning. Trends in Cognitive Sciences, 10(7):309–318, 2006.

[18] Joshua B. Tenenbaum and Thomas L. Grifﬁths. Structure learning in human causal induction. NIPS 13, pages 59–65, 2000.

[19] A. C. Courville, N. D. Daw, G. J. Gordon, and D. S. Touretzky. Model uncertainty in classical conditioning. Advances in Neural Information Processing Systems, (16):977–986, 2004.

[20] Daniel Acuna and Paul Schrater. Bayesian modeling of human sequential decision-making on the multi-armed bandit problem. In CogSci, 2008.

[21] Michael D. Lee. A hierarchical bayesian model of human decision-making on an optimal stopping problem. Cognitive Science: A Multidisciplinary Journal, 30:1 – 26, 2006.

[22] Ido Erev and Alvin E. Roth. Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. The American Economic Review, 88(4):848–881, 1998. 8