nips nips2008 nips2008-187 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Quentin J. Huys, Joshua Vogelstein, Peter Dayan
Abstract: Decision making lies at the very heart of many psychiatric diseases. It is also a central theoretical concern in a wide variety of fields and has undergone detailed, in-depth, analyses. We take as an example Major Depressive Disorder (MDD), applying insights from a Bayesian reinforcement learning framework. We focus on anhedonia and helplessness. Helplessness—a core element in the conceptualizations of MDD that has lead to major advances in its treatment, pharmacological and neurobiological understanding—is formalized as a simple prior over the outcome entropy of actions in uncertain environments. Anhedonia, which is an equally fundamental aspect of the disease, is related to the effective reward size. These formulations allow for the design of specific tasks to measure anhedonia and helplessness behaviorally. We show that these behavioral measures capture explicit, questionnaire-based cognitions. We also provide evidence that these tasks may allow classification of subjects into healthy and MDD groups based purely on a behavioural measure and avoiding any verbal reports. There are strong ties between decision making and psychiatry, with maladaptive decisions and behaviors being very prominent in people with psychiatric disorders. Depression is classically seen as following life events such as divorces and job losses. Longitudinal studies, however, have revealed that a significant fraction of the stressors associated with depression do in fact follow MDD onset, and that they are likely due to maladaptive behaviors prominent in MDD (Kendler et al., 1999). Clinically effective ’talking’ therapies for MDD such as cognitive and dialectical behavior therapies (DeRubeis et al., 1999; Bortolotti et al., 2008; Gotlib and Hammen, 2002; Power, 2005) explicitly concentrate on altering patients’ maladaptive behaviors and decision making processes. Decision making is a promising avenue into psychiatry for at least two more reasons. First, it offers powerful analytical tools. Control problems related to decision making are prevalent in a huge diversity of fields, ranging from ecology to economics, computer science and engineering. These fields have produced well-founded and thoroughly characterized frameworks within which many issues in decision making can be framed. Here, we will focus on framing issues identified in psychiatric settings within a normative decision making framework. Its second major strength comes from its relationship to neurobiology, and particularly those neuromodulatory systems which are powerfully affected by all major clinically effective pharmacotherapies in psychiatry. The understanding of these systems has benefited significantly from theoretical accounts of optimal control such as reinforcement learning (Montague et al., 1996; Kapur and Remington, 1996; Smith et al., 1999; Yu and Dayan, 2005; Dayan and Yu, 2006). Such accounts may be useful to identify in more specific terms the roles of the neuromodulators in psychiatry (Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008; Dayan and Huys, 2008). ∗ qhuys@cantab.net, joshuav@jhu.edu, dayan@gatsby.ucl.ac.uk; www.gatsby.ucl.ac.uk/∼qhuys/pub.html 1 Master Yoked Control Figure 1: The learned helplessness (LH) paradigm. Three sets of rats are used in a sequence of two tasks. In the first task, rats are exposed to escapable or inescapable shocks. Shocks come on at random times. The master rat is given escapable shocks: it can switch off the shock by performing an action, usually turning a wheel mounted in front of it. The yoked rat is exposed to precisely the same shocks as the master rat, i.e its shocks are terminated when the master rat terminates the shock. Thus its shocks are inescapable, there is nothing it can do itself to terminate them. A third set of rats is not exposed to shocks. Then, all three sets of rats are exposed to a shuttlebox escape task. Shocks again come on at random times, and rats have to shuttle to the other side of the box to terminate the shock. Only yoked rats fail to acquire the escape response. Yoked rats generally fail to acquire a wide variety of instrumental behaviours, either determined by reward or, as here, by punishment contingencies. This paper represents an initial attempt at validating this approach experimentally. We will frame core notions of MDD in a reinforcement learning framework and use it to design behavioral decision making experiments. More specifically, we will concentrate on two concepts central to current thinking about MDD: anhedonia and learned helplessness (LH, Maier and Seligman 1976; Maier and Watkins 2005). We formulate helplessness parametrically as prior beliefs on aspects of decision trees, and anhedonia as the effective reward size. This allows us to use choice behavior to infer the degree to which subjects’ behavioral choices are characterized by either of these. For validation, we correlate the parameters inferred from subjects’ behavior with standard, questionnaire-based measures of hopelessness and anhedonia, and finally use the inferred parameters alone to attempt to recover the diagnostic classification. 1 Core concepts: helplessness and anhedonia The basic LH paradigm is explained in figure 1. Its importance is manifold: the effect of inescapable shock on subsequent learning is sensitive to most classes of clinically effective antidepressants; it has arguably been a motivation framework for the development of the main talking therapies for depression (cognitive behavioural therapy, Williams (1992), it has motivated the development of further, yet more specific animal models (Willner, 1997), and it has been the basis of very specific research into the cognitive basis of depression (Peterson et al., 1993). Behavioral control is the central concept in LH: yoked and master rat do not differ in terms of the amount of shock (stress) they have experienced, only in terms of the behavioural control over it. It is not a standard notion in reinforcement learning, and there are several ways one could translate the concept into RL terms. At a simple level, there is intuitively more behavioural control if, when repeating one action, the same outcome occurs again and again, than if this were not true. Thus, at a very first level, control might be related to the outcome entropy of actions (see Maier and Seligman 1976 for an early formulation). Of course, this is too simple. If all available actions deterministically led to the same outcome, the agent has very little control. Finally, if one were able to achieve all outcomes except for the one one cares about (in the rats’ case switching off or avoiding the shock), we would again not say that there is much control (see Huys (2007); Huys and Dayan (2007) for a more detailed discussion). Despite its obvious limitations, we will here concentrate on the simplest notion for reasons of mathematical expediency. 2 0.6 0.5 Exploration vs Exploitation Predictive Distributions Q(aknown)−Q(aunknown) P(reward a known ) 0.7 2 0 1 2 3 4 5 0.4 0.3 0.2 Choose blue slot machine 0.5 0 −0.5 0.1 0 1 1 2 3 4 5 Reward −1 Choose orange slot machine 1 High control Low control 2 3 4 5 Tree depth Figure 2: Effect of γ on predictions, Q-values and exploration behaviour. Assume a slot machine (blue) has been chosen five times, with possible rewards 1-5, and that reward 2 has been obtained twice, and reward 4 three times (inset in left panel). Left: Predictive distribution for a prior with negative γ (low control) in light gray, and large γ (extensive control) in dark gray. We see that, if the agent believes he has much control (and outcome distributions have low entropy), the predictive distribution puts all mass on the observations. Right: Assume now the agent gets up to 5 more pulls (tree depth 1-5) between the blue slot machine and a new, orange slot machine. The orange slot machine’s predictive distribution is flat as it has never been tried, and its expected value is therefore 3. The plot shows the difference between the values for the two slot machines. First consider the agent only has one more pull to take. In this case, independently of the priors about control, the agent will choose the blue machine, because it is just slightly better than average. Note though that the difference is more pronounced if the agent has a high control prior. But things change if the agent has two or more choices. Now, it is worth trying out the new machine if the agent has a high-control prior. For in that case, if the new machine turns out to yield a large reward on the first try, it is likely to do so again for the second and subsequent times. Thus, the prior about control determines the exploration bonus. The second central concept in current conceptions of MDD is that of reward sensitivity. Anhedonia, an inability to enjoy previously enjoyable things, is one of two symptoms necessary for the diagnosis of depression (American Psychiatric Association, 1994). A number of tasks in the literature have attempted to measure reward sensitivity behaviourally. While these generally concur in finding decreased reward sensitivity in subjects with MDD, these results need further clarification. Some studies show interactions between reward and punishment sensitivities with respect to MDD, but important aspects of the tasks are not clearly understood. For instance, Henriques et al. (1994); Henriques and Davidson (2000) show decreased resonsiveness of MDD subjects to rewards, but equally show decreased resonsiveness of healthy subjects to punishments. Pizzagalli et al. (2005) introduced an asymmetrically rewarded perceptual discrimination task and show that the rate of change of the response bias is anticorrelated with subjects’ anhedonic symptoms. Exactly how decreased reward responsivity can account for this is at pressent not clear. Great care has to be taken to disentangle these two concepts. Anhedonia and helplessness both provide good reasons for not taking an action: either because the reinforcements associated with the action are insufficient (anhedonia), or because the outcome is not judged a likely result of taking some particular action (if actions are thought to have large outcome entropy). 2 A Bayesian formulation of control We consider a scenario where subjects have no knowledge of the outcome distributions of actions, but rather learn about them. This means that their prior beliefs about the outcome distributions are not overwhelmed by the likelihood of observations, and may thus have measurable effects on their action choices. In terms of RL, this means that agents do not know the decision tree of the problem they face. Control is formulated as a prior distribution on the outcome distributions, and thereby as a prior distribution on the decision trees. The concentration parameter α of a Dirichlet process can very simply parametrise entropy, and, if used as a prior, allow for very efficient updates of the predictive distributions of actions. Let us assume we have actions A which have as outcomes rewards R, and keep count Nt (r, a) = 3 k:k < 0. Here, we included a regressor for the AGE as that was a confounding variable in our subject sample. Furthermore, if it is true that anhedonia, as expressed by the questionnaire, relates to reward sensitivity specifically, we should be able to write a similar regression for the learning rate ǫ (from equation 5) ǫ(BDIa, AGE) = θǫ BDIa + cǫ AGE + ζǫ but find that θǫ is not different from zero. Figure 4 shows the ML values for the parameters of interest (emphasized in blue in the equations) and confirms that people who express higher levels of anhedonia do indeed show less reward sensitivity, but do not differ in terms of learning rate. If it were the case that subjects with higher BDIa score were just less attentive to the task, one might also expect an effect of BDIa on learning rate. 3.2 Control Validation: The control task is new, and we first need to ascertain that subjects were indeed sensitive to main features of the task. We thus fit both a RW-learning rule (as in the previous section, but adjusted for the varying number of available actions), and the full control model. Importantly, both these models have two parameters, but only the full control model has a notion of outcome entropy, and evaluations a tree. The chance probability of subjects’ actions was 0.37, meaning that, on average, there were just under three machines on the screen. The probability of the actions under the RW-learning rule was better at 0.48, and that of the full control model 0.54. These differences are highly significant as the total number of choices is 29600. Thus, we conclude that subjects were indeed sensitive to the manipulation of outcome entropy, and that they did look ahead in a tree. Prior belief about control: Applying the procedure from the previous task to the main task, we write the main parameters of equations 2 and 4 as functions of the questionnaire measures and infer linear parameters: γ1 (BDIa, BHS, age) = χγ1 BHS + θγ1 BDIa + cγ1 AGE + ζγ1 γ2 (BDIa, BHS, age) = χγ2 BHS + θγ2 BDIa + cγ2 AGE + ζγ2 β(BDIa, BHS, age) = χβ BHS + θβ BDIa + cβ AGE + ζβ Importantly, because the BDIa scores and the BHS scores are correlated in our sample (they tend to be large for the subjects with MDD), we include the cross-terms (θγ1 , θγ2 , χγ ), as we are interested in the specific effects of BDIa on β, as before, and of BHS on γ. 6 3 control γ 2 Figure 6: Classification. Controls are shown as black dots, and depressed subjects as red crosses. The blue line is a linear classifier. Thus, the patients and controls can be approximately classified purely on the basis of behaviour. 1 0 83% correct 69% sensitivity 94% specificity −1 −2 2 4 6 8 10 12 14 16 reward sensitivity β We here infer and display two separate values γ1 and γ2 . These correspond to the level of control in the first and the second half of the experiment. In fact, to parallel the LH experiments better, the slot machines in the first 50 rooms were actually very noisy (low true γ), which means that subjects were here exposed to low levels of control just like the yoked rats in the original experiment. In the second half of the experiment on the other hand, slot machines tended to be quite reliable (high true γ). Figure 5 shows again the ML values for the parameters of interest (emphasized in blue in the equations). Again, we find that our parameter estimate are very significantly different from zero (> three standard deviations). The effect of the BHS score on the prior beliefs about control γ is much stronger in the second half than of the experiment in the first half, i.e. the effect of BHS on the prior belief about control is particularly prominent when subjects are in a high-control environment and have previously been exposed to a low-control environment. This is an interesting parallel to the learned helplessness experiments in animals. 3.3 Classification Finally we combine the two tasks. We integrate out the learning rate ǫ, which we had found not be related to the questionnaire measures (c.f. figure 4), and use the distribution over β from the first task as a prior distribution on β for the second task. We also put weak priors on γ and infer both β and γ for the second task on a subject-by-subject basis. Figure 6 shows the posterior values for γ and β for MDD and healthy subjects and the ability of a linear classifier to classify them. 4 Discussion In this paper, we have attempted to provide a specific formulation of core psychiatric concepts in reinforcement learning terms, i.e. hopelessness as a prior belief about controllability, and anhedonia as reward sensitivity. We have briefly explained how we expect these formulations to have effect in a behavioural situation, have presented a behavioral task explicitly designed to be sensitive to our formulations, and shown that people’s verbal expression of hopelessness and anhedonia do have specific behavioral impacts. Subjects who express anhedonia display insensitivity to rewards and those expressing hopelessness behave as if they had prior beliefs that outcome distributions of actions (slot machines) are very broad. Finally, we have shown that these purely behavioural measures are also predictive of their psychiatric status, in that we were able to classify patients and healthy controls purely on the basis of performance. Several aspects of this work are novel. There have been previous attempts to map aspects of psychiatric dysfunction onto specific parametrizations (Cohen et al., 1996; Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008), but we believe that our work represents the first attempt to a) apply it to MDD; b) make formal predictions about subject behavior c) present strong evidence linking anhedonia specifically to reward insensitivity across two tasks d) combine tasks to tease helplessness and anhedonia apart and e) to use the behavioral inferences for classification. The latter point is particularly important, as it will determine any potential clinical significance (Veiel, 1997). In the future, rather than cross-validating with respect to say DSM-IV criteria, it may also be important to validate measures such as ours in their own right in longitudinal studies. 7 Several important caveats do remain. First, the populations are not fully matched for age. We included age as an additional regressor and found all results to be robust. Secondly, only the healthy subjects were remunerated. However, repeating the analyses presented using only the MDD subjects yields the same results (data not shown). Thirdly, we have not yet fully mirrored the LH experiments. We have so far only tested the transfer from a low-control environment to a high-control environment. To make statements like those in animal learned helplessness experiments, the transfer from high-control to low-control environments will need to be examined, too. Fourth, the notion of control we have used is very simple, and more complex notions should certainly be tested (see Dayan and Huys 2008). Fifth, and maybe most importantly, we have so far only attempted to classify MDD and healthy subjects, and can thus not yet make any statements about the specificity of these effects with respect to MDD. Finally, it will be important to replicate these results independently, and possibly in a different modality. Nevertheless, we believe these results to be very encouraging. Acknowledgments: This work would not have been possible without the help of Sarah Hollingsworth Lisanby, Kenneth Miller and Ramin V. Parsey. We would also like to thank Nathaniel Daw and Hanneke EM Den Ouden and Ren´ Hen for invaluable discussions. Support for this work was provided by the Gatsby Charitable e Foundation (PD), a UCL Bogue Fellowship and the Swartz Foundation (QH) and a Columbia University startup grant to Kenneth Miller. References American Psychiatric Association (1994). Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Association Press. Bortolotti, B., Menchetti, M., Bellini, F., Montaguti, M. B., and Berardi, D. (2008). Psychological interventions for major depression in primary care: a meta-analytic review of randomized controlled trials. Gen Hosp Psychiatry, 30(4):293–302. Cohen, J. D., Braver, T. S., and O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control and schizophrenia: recent developments and current challenges. Philos Trans R Soc Lond B Biol Sci, 351(1346):1515–1527. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., and Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879. Dayan, P. and Huys, Q. J. M. (2008). Serotonin, inhibition, and negative mood. PLoS Comput Biol, 4(2):e4. Dayan, P. and Yu, A. J. (2006). Phasic norepinephrine: a neural interrupt signal for unexpected events. Network, 17(4):335– 350. DeRubeis, R. J., Gelfand, L. A., Tang, T. Z., and Simons, A. D. (1999). Medications versus cognitive behavior therapy for severely depressed outpatients: mega-analysis of four randomized comparisons. Am J Psychiatry, 156(7):1007–1013. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002a). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Non-Patient Edition. (SCID-I/NP). Biometrics Research, New York State Psychiatric Institute. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002b). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Patient Edition. (SCID-I/P). Biometrics Research, New York State Psychiatric Institute. Gotlib, I. H. and Hammen, C. L., editors (2002). Handbook of Depression. The Guilford Press. Henriques, J. B. and Davidson, R. J. (2000). Decreased responsiveness to reward in depression. Cognition and Emotion, 14(5):711–24. Henriques, J. B., Glowacki, J. M., and Davidson, R. J. (1994). Reward fails to alter response bias in depression. J Abnorm Psychol, 103(3):460–6. Huys, Q. J. M. (2007). Reinforcers and control. Towards a computational ætiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of London. Huys, Q. J. M. and Dayan, P. (2007). A bayesian formulation of behavioral control. Under Review, 0:00. Kapur, S. and Remington, G. (1996). Serotonin-dopamine interaction and its relevance to schizophrenia. Am J Psychiatry, 153(4):466–76. Kendler, K. S., Karkowski, L. M., and Prescott, C. A. (1999). Causal relationship between stressful life events and the onset of major depression. Am. J. Psychiatry, 156:837–41. Maier, S. and Seligman, M. (1976). Learned Helplessness: Theory and Evidence. Journal of Experimental Psychology: General, 105(1):3–46. Maier, S. F. and Watkins, L. R. (2005). Stressor controllability and learned helplessness: the roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci. Biobehav. Rev., 29(4-5):829–41. Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci., 16(5):1936–47. Moutoussis, M., Bentall, R. P., Williams, J., and Dayan, P. (2008). A temporal difference account of avoidance learning. Network, 19(2):137–160. Peterson, C., Maier, S. F., and Seligman, M. E. P. (1993). Learned Helplessness: A theory for the age of personal control. OUP, Oxford, UK. Pizzagalli, D. A., Jahn, A. L., and O’Shea, J. P. (2005). Toward an objective characterization of an anhedonic phenotype: a signal-detection approach. Biol Psychiatry, 57(4):319–327. Power, M., editor (2005). Mood Disorders: A Handbook of Science and Practice. John Wiley and Sons, paperback edition. Smith, A., Li, M., Becker, S., and Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: a computational approach. Neuropsychopharm., 29(6):1040–9. Smith, K. A., Morris, J. S., Friston, K. J., Cowen, P. J., and Dolan, R. J. (1999). Brain mechanisms associated with depressive relapse and associated cognitive impairment following acute tryptophan depletion. Br. J. Psychiatry, 174:525–9. Veiel, H. O. F. (1997). A preliminary profile of neuropsychological deficits associated with major depression. J. Clin. Exp. Neuropsychol., 19:587–603. Williams, J. and Dayan, P. (2005). Dopamine, learning, and impulsivity: a biological account of attentiondeficit/hyperactivity disorder. J Child Adolesc Psychopharmacol, 15(2):160–79; discussion 157–9. Williams, J. M. G. (1992). The psychological treatment of depression. Routledge. Willner, P. (1997). Validity, reliability and utility of the chronic mild stress model of depression: a 10-year review and evaluation. Psychopharm, 134:319–29. Yu, A. J. and Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4):681–692. 8
Reference: text
sentIndex sentText sentNum sentScore
1 Helplessness—a core element in the conceptualizations of MDD that has lead to major advances in its treatment, pharmacological and neurobiological understanding—is formalized as a simple prior over the outcome entropy of actions in uncertain environments. [sent-5, score-0.339]
2 Anhedonia, which is an equally fundamental aspect of the disease, is related to the effective reward size. [sent-6, score-0.149]
3 These formulations allow for the design of specific tasks to measure anhedonia and helplessness behaviorally. [sent-7, score-0.617]
4 We show that these behavioral measures capture explicit, questionnaire-based cognitions. [sent-8, score-0.071]
5 We also provide evidence that these tasks may allow classification of subjects into healthy and MDD groups based purely on a behavioural measure and avoiding any verbal reports. [sent-9, score-0.356]
6 There are strong ties between decision making and psychiatry, with maladaptive decisions and behaviors being very prominent in people with psychiatric disorders. [sent-10, score-0.398]
7 Longitudinal studies, however, have revealed that a significant fraction of the stressors associated with depression do in fact follow MDD onset, and that they are likely due to maladaptive behaviors prominent in MDD (Kendler et al. [sent-12, score-0.28]
8 Clinically effective ’talking’ therapies for MDD such as cognitive and dialectical behavior therapies (DeRubeis et al. [sent-14, score-0.191]
9 , 2008; Gotlib and Hammen, 2002; Power, 2005) explicitly concentrate on altering patients’ maladaptive behaviors and decision making processes. [sent-16, score-0.179]
10 Decision making is a promising avenue into psychiatry for at least two more reasons. [sent-17, score-0.18]
11 Control problems related to decision making are prevalent in a huge diversity of fields, ranging from ecology to economics, computer science and engineering. [sent-19, score-0.047]
12 These fields have produced well-founded and thoroughly characterized frameworks within which many issues in decision making can be framed. [sent-20, score-0.047]
13 Here, we will focus on framing issues identified in psychiatric settings within a normative decision making framework. [sent-21, score-0.273]
14 Its second major strength comes from its relationship to neurobiology, and particularly those neuromodulatory systems which are powerfully affected by all major clinically effective pharmacotherapies in psychiatry. [sent-22, score-0.144]
15 The understanding of these systems has benefited significantly from theoretical accounts of optimal control such as reinforcement learning (Montague et al. [sent-23, score-0.193]
16 Such accounts may be useful to identify in more specific terms the roles of the neuromodulators in psychiatry (Smith et al. [sent-26, score-0.218]
17 html 1 Master Yoked Control Figure 1: The learned helplessness (LH) paradigm. [sent-39, score-0.267]
18 Three sets of rats are used in a sequence of two tasks. [sent-40, score-0.139]
19 In the first task, rats are exposed to escapable or inescapable shocks. [sent-41, score-0.311]
20 The master rat is given escapable shocks: it can switch off the shock by performing an action, usually turning a wheel mounted in front of it. [sent-43, score-0.274]
21 The yoked rat is exposed to precisely the same shocks as the master rat, i. [sent-44, score-0.454]
22 e its shocks are terminated when the master rat terminates the shock. [sent-45, score-0.269]
23 Thus its shocks are inescapable, there is nothing it can do itself to terminate them. [sent-46, score-0.108]
24 Then, all three sets of rats are exposed to a shuttlebox escape task. [sent-48, score-0.249]
25 Shocks again come on at random times, and rats have to shuttle to the other side of the box to terminate the shock. [sent-49, score-0.139]
26 Only yoked rats fail to acquire the escape response. [sent-50, score-0.28]
27 Yoked rats generally fail to acquire a wide variety of instrumental behaviours, either determined by reward or, as here, by punishment contingencies. [sent-51, score-0.324]
28 We will frame core notions of MDD in a reinforcement learning framework and use it to design behavioral decision making experiments. [sent-53, score-0.192]
29 More specifically, we will concentrate on two concepts central to current thinking about MDD: anhedonia and learned helplessness (LH, Maier and Seligman 1976; Maier and Watkins 2005). [sent-54, score-0.655]
30 We formulate helplessness parametrically as prior beliefs on aspects of decision trees, and anhedonia as the effective reward size. [sent-55, score-0.892]
31 This allows us to use choice behavior to infer the degree to which subjects’ behavioral choices are characterized by either of these. [sent-56, score-0.071]
32 For validation, we correlate the parameters inferred from subjects’ behavior with standard, questionnaire-based measures of hopelessness and anhedonia, and finally use the inferred parameters alone to attempt to recover the diagnostic classification. [sent-57, score-0.082]
33 1 Core concepts: helplessness and anhedonia The basic LH paradigm is explained in figure 1. [sent-58, score-0.617]
34 Behavioral control is the central concept in LH: yoked and master rat do not differ in terms of the amount of shock (stress) they have experienced, only in terms of the behavioural control over it. [sent-61, score-0.658]
35 At a simple level, there is intuitively more behavioural control if, when repeating one action, the same outcome occurs again and again, than if this were not true. [sent-63, score-0.308]
36 Thus, at a very first level, control might be related to the outcome entropy of actions (see Maier and Seligman 1976 for an early formulation). [sent-64, score-0.339]
37 If all available actions deterministically led to the same outcome, the agent has very little control. [sent-66, score-0.147]
38 Finally, if one were able to achieve all outcomes except for the one one cares about (in the rats’ case switching off or avoiding the shock), we would again not say that there is much control (see Huys (2007); Huys and Dayan (2007) for a more detailed discussion). [sent-67, score-0.117]
39 1 0 1 1 2 3 4 5 Reward −1 Choose orange slot machine 1 High control Low control 2 3 4 5 Tree depth Figure 2: Effect of γ on predictions, Q-values and exploration behaviour. [sent-78, score-0.422]
40 Assume a slot machine (blue) has been chosen five times, with possible rewards 1-5, and that reward 2 has been obtained twice, and reward 4 three times (inset in left panel). [sent-79, score-0.473]
41 We see that, if the agent believes he has much control (and outcome distributions have low entropy), the predictive distribution puts all mass on the observations. [sent-81, score-0.339]
42 Right: Assume now the agent gets up to 5 more pulls (tree depth 1-5) between the blue slot machine and a new, orange slot machine. [sent-82, score-0.448]
43 The orange slot machine’s predictive distribution is flat as it has never been tried, and its expected value is therefore 3. [sent-83, score-0.225]
44 The plot shows the difference between the values for the two slot machines. [sent-84, score-0.139]
45 First consider the agent only has one more pull to take. [sent-85, score-0.077]
46 In this case, independently of the priors about control, the agent will choose the blue machine, because it is just slightly better than average. [sent-86, score-0.121]
47 Note though that the difference is more pronounced if the agent has a high control prior. [sent-87, score-0.194]
48 But things change if the agent has two or more choices. [sent-88, score-0.077]
49 Now, it is worth trying out the new machine if the agent has a high-control prior. [sent-89, score-0.077]
50 For in that case, if the new machine turns out to yield a large reward on the first try, it is likely to do so again for the second and subsequent times. [sent-90, score-0.149]
51 Thus, the prior about control determines the exploration bonus. [sent-91, score-0.153]
52 The second central concept in current conceptions of MDD is that of reward sensitivity. [sent-92, score-0.149]
53 Anhedonia, an inability to enjoy previously enjoyable things, is one of two symptoms necessary for the diagnosis of depression (American Psychiatric Association, 1994). [sent-93, score-0.117]
54 A number of tasks in the literature have attempted to measure reward sensitivity behaviourally. [sent-94, score-0.239]
55 While these generally concur in finding decreased reward sensitivity in subjects with MDD, these results need further clarification. [sent-95, score-0.408]
56 Some studies show interactions between reward and punishment sensitivities with respect to MDD, but important aspects of the tasks are not clearly understood. [sent-96, score-0.185]
57 (1994); Henriques and Davidson (2000) show decreased resonsiveness of MDD subjects to rewards, but equally show decreased resonsiveness of healthy subjects to punishments. [sent-98, score-0.571]
58 Exactly how decreased reward responsivity can account for this is at pressent not clear. [sent-101, score-0.194]
59 2 A Bayesian formulation of control We consider a scenario where subjects have no knowledge of the outcome distributions of actions, but rather learn about them. [sent-104, score-0.386]
60 This means that their prior beliefs about the outcome distributions are not overwhelmed by the likelihood of observations, and may thus have measurable effects on their action choices. [sent-105, score-0.235]
61 In terms of RL, this means that agents do not know the decision tree of the problem they face. [sent-106, score-0.047]
62 Control is formulated as a prior distribution on the outcome distributions, and thereby as a prior distribution on the decision trees. [sent-107, score-0.227]
63 Let us assume we have actions A which have as outcomes rewards R, and keep count Nt (r, a) = 3 k:k < 0. [sent-109, score-0.106]
64 Furthermore, if it is true that anhedonia, as expressed by the questionnaire, relates to reward sensitivity specifically, we should be able to write a similar regression for the learning rate ǫ (from equation 5) ǫ(BDIa, AGE) = θǫ BDIa + cǫ AGE + ζǫ but find that θǫ is not different from zero. [sent-111, score-0.202]
65 Figure 4 shows the ML values for the parameters of interest (emphasized in blue in the equations) and confirms that people who express higher levels of anhedonia do indeed show less reward sensitivity, but do not differ in terms of learning rate. [sent-112, score-0.543]
66 If it were the case that subjects with higher BDIa score were just less attentive to the task, one might also expect an effect of BDIa on learning rate. [sent-113, score-0.161]
67 2 Control Validation: The control task is new, and we first need to ascertain that subjects were indeed sensitive to main features of the task. [sent-115, score-0.278]
68 We thus fit both a RW-learning rule (as in the previous section, but adjusted for the varying number of available actions), and the full control model. [sent-116, score-0.117]
69 Importantly, both these models have two parameters, but only the full control model has a notion of outcome entropy, and evaluations a tree. [sent-117, score-0.225]
70 The probability of the actions under the RW-learning rule was better at 0. [sent-120, score-0.07]
71 Thus, we conclude that subjects were indeed sensitive to the manipulation of outcome entropy, and that they did look ahead in a tree. [sent-124, score-0.269]
72 Controls are shown as black dots, and depressed subjects as red crosses. [sent-127, score-0.202]
73 Thus, the patients and controls can be approximately classified purely on the basis of behaviour. [sent-129, score-0.07]
74 1 0 83% correct 69% sensitivity 94% specificity −1 −2 2 4 6 8 10 12 14 16 reward sensitivity β We here infer and display two separate values γ1 and γ2 . [sent-130, score-0.255]
75 These correspond to the level of control in the first and the second half of the experiment. [sent-131, score-0.117]
76 In fact, to parallel the LH experiments better, the slot machines in the first 50 rooms were actually very noisy (low true γ), which means that subjects were here exposed to low levels of control just like the yoked rats in the original experiment. [sent-132, score-0.741]
77 In the second half of the experiment on the other hand, slot machines tended to be quite reliable (high true γ). [sent-133, score-0.139]
78 The effect of the BHS score on the prior beliefs about control γ is much stronger in the second half than of the experiment in the first half, i. [sent-136, score-0.196]
79 the effect of BHS on the prior belief about control is particularly prominent when subjects are in a high-control environment and have previously been exposed to a low-control environment. [sent-138, score-0.422]
80 This is an interesting parallel to the learned helplessness experiments in animals. [sent-139, score-0.267]
81 We integrate out the learning rate ǫ, which we had found not be related to the questionnaire measures (c. [sent-142, score-0.046]
82 Figure 6 shows the posterior values for γ and β for MDD and healthy subjects and the ability of a linear classifier to classify them. [sent-146, score-0.238]
83 4 Discussion In this paper, we have attempted to provide a specific formulation of core psychiatric concepts in reinforcement learning terms, i. [sent-147, score-0.337]
84 hopelessness as a prior belief about controllability, and anhedonia as reward sensitivity. [sent-149, score-0.617]
85 We have briefly explained how we expect these formulations to have effect in a behavioural situation, have presented a behavioral task explicitly designed to be sensitive to our formulations, and shown that people’s verbal expression of hopelessness and anhedonia do have specific behavioral impacts. [sent-150, score-0.657]
86 Subjects who express anhedonia display insensitivity to rewards and those expressing hopelessness behave as if they had prior beliefs that outcome distributions of actions (slot machines) are very broad. [sent-151, score-0.761]
87 Finally, we have shown that these purely behavioural measures are also predictive of their psychiatric status, in that we were able to classify patients and healthy controls purely on the basis of performance. [sent-152, score-0.528]
88 There have been previous attempts to map aspects of psychiatric dysfunction onto specific parametrizations (Cohen et al. [sent-154, score-0.264]
89 We included age as an additional regressor and found all results to be robust. [sent-162, score-0.132]
90 However, repeating the analyses presented using only the MDD subjects yields the same results (data not shown). [sent-164, score-0.161]
91 To make statements like those in animal learned helplessness experiments, the transfer from high-control to low-control environments will need to be examined, too. [sent-167, score-0.267]
92 Fourth, the notion of control we have used is very simple, and more complex notions should certainly be tested (see Dayan and Huys 2008). [sent-168, score-0.117]
93 Fifth, and maybe most importantly, we have so far only attempted to classify MDD and healthy subjects, and can thus not yet make any statements about the specificity of these effects with respect to MDD. [sent-169, score-0.114]
94 Psychological interventions for major depression in primary care: a meta-analytic review of randomized controlled trials. [sent-186, score-0.162]
95 A computational approach to prefrontal cortex, cognitive control and schizophrenia: recent developments and current challenges. [sent-195, score-0.162]
96 Medications versus cognitive behavior therapy for severely depressed outpatients: mega-analysis of four randomized comparisons. [sent-230, score-0.119]
97 Causal relationship between stressful life events and the onset of major depression. [sent-303, score-0.045]
98 Learned Helplessness: A theory for the age of personal control. [sent-347, score-0.132]
99 A model of antipsychotic action in conditioned avoidance: a computational approach. [sent-367, score-0.048]
100 Brain mechanisms associated with depressive relapse and associated cognitive impairment following acute tryptophan depletion. [sent-381, score-0.081]
wordName wordTfidf (topN-words)
[('anhedonia', 0.35), ('mdd', 0.306), ('helplessness', 0.267), ('bdia', 0.247), ('psychiatric', 0.226), ('bhs', 0.206), ('psychiatry', 0.18), ('subjects', 0.161), ('dayan', 0.154), ('reward', 0.149), ('huys', 0.144), ('slot', 0.139), ('rats', 0.139), ('age', 0.132), ('control', 0.117), ('depression', 0.117), ('shocks', 0.108), ('yoked', 0.108), ('maier', 0.108), ('outcome', 0.108), ('rat', 0.103), ('lh', 0.099), ('behavioural', 0.083), ('henriques', 0.082), ('hopelessness', 0.082), ('seligman', 0.082), ('exposed', 0.077), ('healthy', 0.077), ('agent', 0.077), ('williams', 0.072), ('shock', 0.072), ('behavioral', 0.071), ('actions', 0.07), ('disorders', 0.062), ('kapur', 0.062), ('maladaptive', 0.062), ('moutoussis', 0.062), ('master', 0.058), ('therapies', 0.054), ('davidson', 0.054), ('inescapable', 0.054), ('clinically', 0.054), ('sensitivity', 0.053), ('orange', 0.049), ('action', 0.048), ('smith', 0.047), ('decision', 0.047), ('questionnaire', 0.046), ('major', 0.045), ('decreased', 0.045), ('cognitive', 0.045), ('entropy', 0.044), ('blue', 0.044), ('beliefs', 0.043), ('anhedonic', 0.041), ('avoidance', 0.041), ('bortolotti', 0.041), ('depressed', 0.041), ('derubeis', 0.041), ('dolan', 0.041), ('escapable', 0.041), ('gibbon', 0.041), ('gotlib', 0.041), ('hammen', 0.041), ('kendler', 0.041), ('pizzagalli', 0.041), ('remington', 0.041), ('resonsiveness', 0.041), ('spitzer', 0.041), ('veiel', 0.041), ('willner', 0.041), ('concentrate', 0.038), ('biol', 0.038), ('reinforcement', 0.038), ('et', 0.038), ('predictive', 0.037), ('attempted', 0.037), ('yu', 0.036), ('prior', 0.036), ('insensitivity', 0.036), ('interview', 0.036), ('depressive', 0.036), ('kenneth', 0.036), ('punishment', 0.036), ('controllability', 0.036), ('peterson', 0.036), ('serotonin', 0.036), ('talking', 0.036), ('rewards', 0.036), ('core', 0.036), ('purely', 0.035), ('patients', 0.035), ('clinical', 0.034), ('gatsby', 0.033), ('escape', 0.033), ('therapy', 0.033), ('biometrics', 0.033), ('longitudinal', 0.033), ('behaviors', 0.032), ('prominent', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
Author: Quentin J. Huys, Joshua Vogelstein, Peter Dayan
Abstract: Decision making lies at the very heart of many psychiatric diseases. It is also a central theoretical concern in a wide variety of fields and has undergone detailed, in-depth, analyses. We take as an example Major Depressive Disorder (MDD), applying insights from a Bayesian reinforcement learning framework. We focus on anhedonia and helplessness. Helplessness—a core element in the conceptualizations of MDD that has lead to major advances in its treatment, pharmacological and neurobiological understanding—is formalized as a simple prior over the outcome entropy of actions in uncertain environments. Anhedonia, which is an equally fundamental aspect of the disease, is related to the effective reward size. These formulations allow for the design of specific tasks to measure anhedonia and helplessness behaviorally. We show that these behavioral measures capture explicit, questionnaire-based cognitions. We also provide evidence that these tasks may allow classification of subjects into healthy and MDD groups based purely on a behavioural measure and avoiding any verbal reports. There are strong ties between decision making and psychiatry, with maladaptive decisions and behaviors being very prominent in people with psychiatric disorders. Depression is classically seen as following life events such as divorces and job losses. Longitudinal studies, however, have revealed that a significant fraction of the stressors associated with depression do in fact follow MDD onset, and that they are likely due to maladaptive behaviors prominent in MDD (Kendler et al., 1999). Clinically effective ’talking’ therapies for MDD such as cognitive and dialectical behavior therapies (DeRubeis et al., 1999; Bortolotti et al., 2008; Gotlib and Hammen, 2002; Power, 2005) explicitly concentrate on altering patients’ maladaptive behaviors and decision making processes. Decision making is a promising avenue into psychiatry for at least two more reasons. First, it offers powerful analytical tools. Control problems related to decision making are prevalent in a huge diversity of fields, ranging from ecology to economics, computer science and engineering. These fields have produced well-founded and thoroughly characterized frameworks within which many issues in decision making can be framed. Here, we will focus on framing issues identified in psychiatric settings within a normative decision making framework. Its second major strength comes from its relationship to neurobiology, and particularly those neuromodulatory systems which are powerfully affected by all major clinically effective pharmacotherapies in psychiatry. The understanding of these systems has benefited significantly from theoretical accounts of optimal control such as reinforcement learning (Montague et al., 1996; Kapur and Remington, 1996; Smith et al., 1999; Yu and Dayan, 2005; Dayan and Yu, 2006). Such accounts may be useful to identify in more specific terms the roles of the neuromodulators in psychiatry (Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008; Dayan and Huys, 2008). ∗ qhuys@cantab.net, joshuav@jhu.edu, dayan@gatsby.ucl.ac.uk; www.gatsby.ucl.ac.uk/∼qhuys/pub.html 1 Master Yoked Control Figure 1: The learned helplessness (LH) paradigm. Three sets of rats are used in a sequence of two tasks. In the first task, rats are exposed to escapable or inescapable shocks. Shocks come on at random times. The master rat is given escapable shocks: it can switch off the shock by performing an action, usually turning a wheel mounted in front of it. The yoked rat is exposed to precisely the same shocks as the master rat, i.e its shocks are terminated when the master rat terminates the shock. Thus its shocks are inescapable, there is nothing it can do itself to terminate them. A third set of rats is not exposed to shocks. Then, all three sets of rats are exposed to a shuttlebox escape task. Shocks again come on at random times, and rats have to shuttle to the other side of the box to terminate the shock. Only yoked rats fail to acquire the escape response. Yoked rats generally fail to acquire a wide variety of instrumental behaviours, either determined by reward or, as here, by punishment contingencies. This paper represents an initial attempt at validating this approach experimentally. We will frame core notions of MDD in a reinforcement learning framework and use it to design behavioral decision making experiments. More specifically, we will concentrate on two concepts central to current thinking about MDD: anhedonia and learned helplessness (LH, Maier and Seligman 1976; Maier and Watkins 2005). We formulate helplessness parametrically as prior beliefs on aspects of decision trees, and anhedonia as the effective reward size. This allows us to use choice behavior to infer the degree to which subjects’ behavioral choices are characterized by either of these. For validation, we correlate the parameters inferred from subjects’ behavior with standard, questionnaire-based measures of hopelessness and anhedonia, and finally use the inferred parameters alone to attempt to recover the diagnostic classification. 1 Core concepts: helplessness and anhedonia The basic LH paradigm is explained in figure 1. Its importance is manifold: the effect of inescapable shock on subsequent learning is sensitive to most classes of clinically effective antidepressants; it has arguably been a motivation framework for the development of the main talking therapies for depression (cognitive behavioural therapy, Williams (1992), it has motivated the development of further, yet more specific animal models (Willner, 1997), and it has been the basis of very specific research into the cognitive basis of depression (Peterson et al., 1993). Behavioral control is the central concept in LH: yoked and master rat do not differ in terms of the amount of shock (stress) they have experienced, only in terms of the behavioural control over it. It is not a standard notion in reinforcement learning, and there are several ways one could translate the concept into RL terms. At a simple level, there is intuitively more behavioural control if, when repeating one action, the same outcome occurs again and again, than if this were not true. Thus, at a very first level, control might be related to the outcome entropy of actions (see Maier and Seligman 1976 for an early formulation). Of course, this is too simple. If all available actions deterministically led to the same outcome, the agent has very little control. Finally, if one were able to achieve all outcomes except for the one one cares about (in the rats’ case switching off or avoiding the shock), we would again not say that there is much control (see Huys (2007); Huys and Dayan (2007) for a more detailed discussion). Despite its obvious limitations, we will here concentrate on the simplest notion for reasons of mathematical expediency. 2 0.6 0.5 Exploration vs Exploitation Predictive Distributions Q(aknown)−Q(aunknown) P(reward a known ) 0.7 2 0 1 2 3 4 5 0.4 0.3 0.2 Choose blue slot machine 0.5 0 −0.5 0.1 0 1 1 2 3 4 5 Reward −1 Choose orange slot machine 1 High control Low control 2 3 4 5 Tree depth Figure 2: Effect of γ on predictions, Q-values and exploration behaviour. Assume a slot machine (blue) has been chosen five times, with possible rewards 1-5, and that reward 2 has been obtained twice, and reward 4 three times (inset in left panel). Left: Predictive distribution for a prior with negative γ (low control) in light gray, and large γ (extensive control) in dark gray. We see that, if the agent believes he has much control (and outcome distributions have low entropy), the predictive distribution puts all mass on the observations. Right: Assume now the agent gets up to 5 more pulls (tree depth 1-5) between the blue slot machine and a new, orange slot machine. The orange slot machine’s predictive distribution is flat as it has never been tried, and its expected value is therefore 3. The plot shows the difference between the values for the two slot machines. First consider the agent only has one more pull to take. In this case, independently of the priors about control, the agent will choose the blue machine, because it is just slightly better than average. Note though that the difference is more pronounced if the agent has a high control prior. But things change if the agent has two or more choices. Now, it is worth trying out the new machine if the agent has a high-control prior. For in that case, if the new machine turns out to yield a large reward on the first try, it is likely to do so again for the second and subsequent times. Thus, the prior about control determines the exploration bonus. The second central concept in current conceptions of MDD is that of reward sensitivity. Anhedonia, an inability to enjoy previously enjoyable things, is one of two symptoms necessary for the diagnosis of depression (American Psychiatric Association, 1994). A number of tasks in the literature have attempted to measure reward sensitivity behaviourally. While these generally concur in finding decreased reward sensitivity in subjects with MDD, these results need further clarification. Some studies show interactions between reward and punishment sensitivities with respect to MDD, but important aspects of the tasks are not clearly understood. For instance, Henriques et al. (1994); Henriques and Davidson (2000) show decreased resonsiveness of MDD subjects to rewards, but equally show decreased resonsiveness of healthy subjects to punishments. Pizzagalli et al. (2005) introduced an asymmetrically rewarded perceptual discrimination task and show that the rate of change of the response bias is anticorrelated with subjects’ anhedonic symptoms. Exactly how decreased reward responsivity can account for this is at pressent not clear. Great care has to be taken to disentangle these two concepts. Anhedonia and helplessness both provide good reasons for not taking an action: either because the reinforcements associated with the action are insufficient (anhedonia), or because the outcome is not judged a likely result of taking some particular action (if actions are thought to have large outcome entropy). 2 A Bayesian formulation of control We consider a scenario where subjects have no knowledge of the outcome distributions of actions, but rather learn about them. This means that their prior beliefs about the outcome distributions are not overwhelmed by the likelihood of observations, and may thus have measurable effects on their action choices. In terms of RL, this means that agents do not know the decision tree of the problem they face. Control is formulated as a prior distribution on the outcome distributions, and thereby as a prior distribution on the decision trees. The concentration parameter α of a Dirichlet process can very simply parametrise entropy, and, if used as a prior, allow for very efficient updates of the predictive distributions of actions. Let us assume we have actions A which have as outcomes rewards R, and keep count Nt (r, a) = 3 k:k < 0. Here, we included a regressor for the AGE as that was a confounding variable in our subject sample. Furthermore, if it is true that anhedonia, as expressed by the questionnaire, relates to reward sensitivity specifically, we should be able to write a similar regression for the learning rate ǫ (from equation 5) ǫ(BDIa, AGE) = θǫ BDIa + cǫ AGE + ζǫ but find that θǫ is not different from zero. Figure 4 shows the ML values for the parameters of interest (emphasized in blue in the equations) and confirms that people who express higher levels of anhedonia do indeed show less reward sensitivity, but do not differ in terms of learning rate. If it were the case that subjects with higher BDIa score were just less attentive to the task, one might also expect an effect of BDIa on learning rate. 3.2 Control Validation: The control task is new, and we first need to ascertain that subjects were indeed sensitive to main features of the task. We thus fit both a RW-learning rule (as in the previous section, but adjusted for the varying number of available actions), and the full control model. Importantly, both these models have two parameters, but only the full control model has a notion of outcome entropy, and evaluations a tree. The chance probability of subjects’ actions was 0.37, meaning that, on average, there were just under three machines on the screen. The probability of the actions under the RW-learning rule was better at 0.48, and that of the full control model 0.54. These differences are highly significant as the total number of choices is 29600. Thus, we conclude that subjects were indeed sensitive to the manipulation of outcome entropy, and that they did look ahead in a tree. Prior belief about control: Applying the procedure from the previous task to the main task, we write the main parameters of equations 2 and 4 as functions of the questionnaire measures and infer linear parameters: γ1 (BDIa, BHS, age) = χγ1 BHS + θγ1 BDIa + cγ1 AGE + ζγ1 γ2 (BDIa, BHS, age) = χγ2 BHS + θγ2 BDIa + cγ2 AGE + ζγ2 β(BDIa, BHS, age) = χβ BHS + θβ BDIa + cβ AGE + ζβ Importantly, because the BDIa scores and the BHS scores are correlated in our sample (they tend to be large for the subjects with MDD), we include the cross-terms (θγ1 , θγ2 , χγ ), as we are interested in the specific effects of BDIa on β, as before, and of BHS on γ. 6 3 control γ 2 Figure 6: Classification. Controls are shown as black dots, and depressed subjects as red crosses. The blue line is a linear classifier. Thus, the patients and controls can be approximately classified purely on the basis of behaviour. 1 0 83% correct 69% sensitivity 94% specificity −1 −2 2 4 6 8 10 12 14 16 reward sensitivity β We here infer and display two separate values γ1 and γ2 . These correspond to the level of control in the first and the second half of the experiment. In fact, to parallel the LH experiments better, the slot machines in the first 50 rooms were actually very noisy (low true γ), which means that subjects were here exposed to low levels of control just like the yoked rats in the original experiment. In the second half of the experiment on the other hand, slot machines tended to be quite reliable (high true γ). Figure 5 shows again the ML values for the parameters of interest (emphasized in blue in the equations). Again, we find that our parameter estimate are very significantly different from zero (> three standard deviations). The effect of the BHS score on the prior beliefs about control γ is much stronger in the second half than of the experiment in the first half, i.e. the effect of BHS on the prior belief about control is particularly prominent when subjects are in a high-control environment and have previously been exposed to a low-control environment. This is an interesting parallel to the learned helplessness experiments in animals. 3.3 Classification Finally we combine the two tasks. We integrate out the learning rate ǫ, which we had found not be related to the questionnaire measures (c.f. figure 4), and use the distribution over β from the first task as a prior distribution on β for the second task. We also put weak priors on γ and infer both β and γ for the second task on a subject-by-subject basis. Figure 6 shows the posterior values for γ and β for MDD and healthy subjects and the ability of a linear classifier to classify them. 4 Discussion In this paper, we have attempted to provide a specific formulation of core psychiatric concepts in reinforcement learning terms, i.e. hopelessness as a prior belief about controllability, and anhedonia as reward sensitivity. We have briefly explained how we expect these formulations to have effect in a behavioural situation, have presented a behavioral task explicitly designed to be sensitive to our formulations, and shown that people’s verbal expression of hopelessness and anhedonia do have specific behavioral impacts. Subjects who express anhedonia display insensitivity to rewards and those expressing hopelessness behave as if they had prior beliefs that outcome distributions of actions (slot machines) are very broad. Finally, we have shown that these purely behavioural measures are also predictive of their psychiatric status, in that we were able to classify patients and healthy controls purely on the basis of performance. Several aspects of this work are novel. There have been previous attempts to map aspects of psychiatric dysfunction onto specific parametrizations (Cohen et al., 1996; Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008), but we believe that our work represents the first attempt to a) apply it to MDD; b) make formal predictions about subject behavior c) present strong evidence linking anhedonia specifically to reward insensitivity across two tasks d) combine tasks to tease helplessness and anhedonia apart and e) to use the behavioral inferences for classification. The latter point is particularly important, as it will determine any potential clinical significance (Veiel, 1997). In the future, rather than cross-validating with respect to say DSM-IV criteria, it may also be important to validate measures such as ours in their own right in longitudinal studies. 7 Several important caveats do remain. First, the populations are not fully matched for age. We included age as an additional regressor and found all results to be robust. Secondly, only the healthy subjects were remunerated. However, repeating the analyses presented using only the MDD subjects yields the same results (data not shown). Thirdly, we have not yet fully mirrored the LH experiments. We have so far only tested the transfer from a low-control environment to a high-control environment. To make statements like those in animal learned helplessness experiments, the transfer from high-control to low-control environments will need to be examined, too. Fourth, the notion of control we have used is very simple, and more complex notions should certainly be tested (see Dayan and Huys 2008). Fifth, and maybe most importantly, we have so far only attempted to classify MDD and healthy subjects, and can thus not yet make any statements about the specificity of these effects with respect to MDD. Finally, it will be important to replicate these results independently, and possibly in a different modality. Nevertheless, we believe these results to be very encouraging. Acknowledgments: This work would not have been possible without the help of Sarah Hollingsworth Lisanby, Kenneth Miller and Ramin V. Parsey. We would also like to thank Nathaniel Daw and Hanneke EM Den Ouden and Ren´ Hen for invaluable discussions. Support for this work was provided by the Gatsby Charitable e Foundation (PD), a UCL Bogue Fellowship and the Swartz Foundation (QH) and a Columbia University startup grant to Kenneth Miller. References American Psychiatric Association (1994). Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Association Press. Bortolotti, B., Menchetti, M., Bellini, F., Montaguti, M. B., and Berardi, D. (2008). Psychological interventions for major depression in primary care: a meta-analytic review of randomized controlled trials. Gen Hosp Psychiatry, 30(4):293–302. Cohen, J. D., Braver, T. S., and O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control and schizophrenia: recent developments and current challenges. Philos Trans R Soc Lond B Biol Sci, 351(1346):1515–1527. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., and Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879. Dayan, P. and Huys, Q. J. M. (2008). Serotonin, inhibition, and negative mood. PLoS Comput Biol, 4(2):e4. Dayan, P. and Yu, A. J. (2006). Phasic norepinephrine: a neural interrupt signal for unexpected events. Network, 17(4):335– 350. DeRubeis, R. J., Gelfand, L. A., Tang, T. Z., and Simons, A. D. (1999). Medications versus cognitive behavior therapy for severely depressed outpatients: mega-analysis of four randomized comparisons. Am J Psychiatry, 156(7):1007–1013. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002a). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Non-Patient Edition. (SCID-I/NP). Biometrics Research, New York State Psychiatric Institute. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002b). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Patient Edition. (SCID-I/P). Biometrics Research, New York State Psychiatric Institute. Gotlib, I. H. and Hammen, C. L., editors (2002). Handbook of Depression. The Guilford Press. Henriques, J. B. and Davidson, R. J. (2000). Decreased responsiveness to reward in depression. Cognition and Emotion, 14(5):711–24. Henriques, J. B., Glowacki, J. M., and Davidson, R. J. (1994). Reward fails to alter response bias in depression. J Abnorm Psychol, 103(3):460–6. Huys, Q. J. M. (2007). Reinforcers and control. Towards a computational ætiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of London. Huys, Q. J. M. and Dayan, P. (2007). A bayesian formulation of behavioral control. Under Review, 0:00. Kapur, S. and Remington, G. (1996). Serotonin-dopamine interaction and its relevance to schizophrenia. Am J Psychiatry, 153(4):466–76. Kendler, K. S., Karkowski, L. M., and Prescott, C. A. (1999). Causal relationship between stressful life events and the onset of major depression. Am. J. Psychiatry, 156:837–41. Maier, S. and Seligman, M. (1976). Learned Helplessness: Theory and Evidence. Journal of Experimental Psychology: General, 105(1):3–46. Maier, S. F. and Watkins, L. R. (2005). Stressor controllability and learned helplessness: the roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci. Biobehav. Rev., 29(4-5):829–41. Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci., 16(5):1936–47. Moutoussis, M., Bentall, R. P., Williams, J., and Dayan, P. (2008). A temporal difference account of avoidance learning. Network, 19(2):137–160. Peterson, C., Maier, S. F., and Seligman, M. E. P. (1993). Learned Helplessness: A theory for the age of personal control. OUP, Oxford, UK. Pizzagalli, D. A., Jahn, A. L., and O’Shea, J. P. (2005). Toward an objective characterization of an anhedonic phenotype: a signal-detection approach. Biol Psychiatry, 57(4):319–327. Power, M., editor (2005). Mood Disorders: A Handbook of Science and Practice. John Wiley and Sons, paperback edition. Smith, A., Li, M., Becker, S., and Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: a computational approach. Neuropsychopharm., 29(6):1040–9. Smith, K. A., Morris, J. S., Friston, K. J., Cowen, P. J., and Dolan, R. J. (1999). Brain mechanisms associated with depressive relapse and associated cognitive impairment following acute tryptophan depletion. Br. J. Psychiatry, 174:525–9. Veiel, H. O. F. (1997). A preliminary profile of neuropsychological deficits associated with major depression. J. Clin. Exp. Neuropsychol., 19:587–603. Williams, J. and Dayan, P. (2005). Dopamine, learning, and impulsivity: a biological account of attentiondeficit/hyperactivity disorder. J Child Adolesc Psychopharmacol, 15(2):160–79; discussion 157–9. Williams, J. M. G. (1992). The psychological treatment of depression. Routledge. Willner, P. (1997). Validity, reliability and utility of the chronic mild stress model of depression: a 10-year review and evaluation. Psychopharm, 134:319–29. Yu, A. J. and Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4):681–692. 8
2 0.13711327 223 nips-2008-Structure Learning in Human Sequential Decision-Making
Author: Daniel Acuna, Paul R. Schrater
Abstract: We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1
3 0.088803716 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
Author: Carmen Sandi, Wulfram Gerstner, Gediminas Lukšys
Abstract: Suppose we train an animal in a conditioning experiment. Can one predict how a given animal, under given experimental conditions, would perform the task? Since various factors such as stress, motivation, genetic background, and previous errors in task performance can influence animal behaviour, this appears to be a very challenging aim. Reinforcement learning (RL) models have been successful in modeling animal (and human) behaviour, but their success has been limited because of uncertainty as to how to set meta-parameters (such as learning rate, exploitation-exploration balance and future reward discount factor) that strongly influence model performance. We show that a simple RL model whose metaparameters are controlled by an artificial neural network, fed with inputs such as stress, affective phenotype, previous task performance, and even neuromodulatory manipulations, can successfully predict mouse behaviour in the ”hole-box” - a simple conditioning task. Our results also provide important insights on how stress and anxiety affect animal learning, performance accuracy, and discounting of future rewards, and on how noradrenergic systems can interact with these processes. 1
4 0.083459765 206 nips-2008-Sequential effects: Superstition or rational behavior?
Author: Angela J. Yu, Jonathan D. Cohen
Abstract: In a variety of behavioral tasks, subjects exhibit an automatic and apparently suboptimal sequential effect: they respond more rapidly and accurately to a stimulus if it reinforces a local pattern in stimulus history, such as a string of repetitions or alternations, compared to when it violates such a pattern. This is often the case even if the local trends arise by chance in the context of a randomized design, such that stimulus history has no real predictive power. In this work, we use a normative Bayesian framework to examine the hypothesis that such idiosyncrasies may reflect the inadvertent engagement of mechanisms critical for adapting to a changing environment. We show that prior belief in non-stationarity can induce experimentally observed sequential effects in an otherwise Bayes-optimal algorithm. The Bayesian algorithm is shown to be well approximated by linear-exponential filtering of past observations, a feature also apparent in the behavioral data. We derive an explicit relationship between the parameters and computations of the exact Bayesian algorithm and those of the approximate linear-exponential filter. Since the latter is equivalent to a leaky-integration process, a commonly used model of neuronal dynamics underlying perceptual decision-making and trial-to-trial dependencies, our model provides a principled account of why such dynamics are useful. We also show that parameter-tuning of the leaky-integration process is possible, using stochastic gradient descent based only on the noisy binary inputs. This is a proof of concept that not only can neurons implement near-optimal prediction based on standard neuronal dynamics, but that they can also learn to tune the processing parameters without explicitly representing probabilities. 1
5 0.082884923 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
Author: Michael T. Todd, Yael Niv, Jonathan D. Cohen
Abstract: Working memory is a central topic of cognitive neuroscience because it is critical for solving real-world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating framework [14] is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We unite Gating with machine learning theory concerning the general problem of memory-based optimal control [5-6]. We present a normative model that learns, by online temporal difference methods, to use working memory to maximize discounted future reward in partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in humans. Our purpose is to introduce a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards. 1 I n t ro d u c t i o n Working memory is loosely defined in cognitive neuroscience as information that is (1) internally maintained on a temporary or short term basis, and (2) required for tasks in which immediate observations cannot be mapped to correct actions. It is widely assumed that prefrontal cortex (PFC) plays a role in maintaining and updating working memory. However, relatively little is known about how PFC develops useful working memory representations for a new task. Furthermore, current work focuses on describing the structure and limitations of working memory, but does not ask why, or in what general class of tasks, is it necessary. Borrowing from the theory of optimal control in partially observable Markov decision problems (POMDPs), we frame the psychological concept of working memory as an internal state representation, developed and employed to maximize future reward in partially observable environments. We combine computational insights from POMDPs and neurobiologically plausible models from cognitive neuroscience to suggest a simple reinforcement learning (RL) model of working memory function that can be implemented through dopaminergic training of the basal ganglia and PFC. The Gating framework is a series of cognitive neuroscience models developed to explain how dopaminergic RL signals can shape useful working memory representations [1-4]. Computationally this framework models working memory as a collection of past observations, each of which can occasionally be replaced with the current observation, and addresses the problem of learning when to update each memory element versus maintaining it. In the original Gating model [1-2] the PFC contained a unitary working memory representation that was updated whenever a phasic dopamine (DA) burst occurred (e.g., due to unexpected reward or novelty). That model was the first to connect working memory and RL via the temporal difference (TD) model of DA firing [7-8], and thus to suggest how working memory might serve a normative purpose. However, that model had limited computational flexibility due to the unitary nature of the working memory (i.e., a singleobservation memory controlled by a scalar DA signal). More recent work [3-4] has partially repositioned the Gating framework within the Actor/Critic model of mesostriatal RL [9-10], positing memory updating as but another cortical action controlled by the dorsal striatal
6 0.082690075 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
7 0.079100363 231 nips-2008-Temporal Dynamics of Cognitive Control
8 0.068162553 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
10 0.057825692 131 nips-2008-MDPs with Non-Deterministic Policies
11 0.055720363 180 nips-2008-Playing Pinball with non-invasive BCI
12 0.052590784 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
13 0.049020376 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
14 0.048960097 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
15 0.045877382 212 nips-2008-Skill Characterization Based on Betweenness
16 0.044993248 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
17 0.043397114 33 nips-2008-Bayesian Model of Behaviour in Economic Games
18 0.041362185 124 nips-2008-Load and Attentional Bayes
19 0.040740568 101 nips-2008-Human Active Learning
20 0.039790958 10 nips-2008-A rational model of preference learning and choice prediction by children
topicId topicWeight
[(0, -0.116), (1, 0.149), (2, 0.054), (3, -0.062), (4, 0.015), (5, -0.036), (6, -0.008), (7, 0.012), (8, 0.036), (9, 0.05), (10, 0.03), (11, 0.056), (12, -0.096), (13, -0.039), (14, -0.01), (15, 0.087), (16, 0.044), (17, -0.045), (18, 0.021), (19, 0.02), (20, 0.024), (21, -0.066), (22, -0.039), (23, 0.036), (24, -0.005), (25, -0.073), (26, 0.013), (27, 0.007), (28, -0.037), (29, -0.092), (30, 0.015), (31, -0.02), (32, 0.059), (33, -0.011), (34, -0.023), (35, -0.024), (36, 0.072), (37, -0.082), (38, 0.071), (39, 0.141), (40, 0.088), (41, 0.022), (42, 0.079), (43, -0.055), (44, 0.031), (45, 0.051), (46, 0.03), (47, 0.046), (48, 0.05), (49, -0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.94910449 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
Author: Quentin J. Huys, Joshua Vogelstein, Peter Dayan
Abstract: Decision making lies at the very heart of many psychiatric diseases. It is also a central theoretical concern in a wide variety of fields and has undergone detailed, in-depth, analyses. We take as an example Major Depressive Disorder (MDD), applying insights from a Bayesian reinforcement learning framework. We focus on anhedonia and helplessness. Helplessness—a core element in the conceptualizations of MDD that has lead to major advances in its treatment, pharmacological and neurobiological understanding—is formalized as a simple prior over the outcome entropy of actions in uncertain environments. Anhedonia, which is an equally fundamental aspect of the disease, is related to the effective reward size. These formulations allow for the design of specific tasks to measure anhedonia and helplessness behaviorally. We show that these behavioral measures capture explicit, questionnaire-based cognitions. We also provide evidence that these tasks may allow classification of subjects into healthy and MDD groups based purely on a behavioural measure and avoiding any verbal reports. There are strong ties between decision making and psychiatry, with maladaptive decisions and behaviors being very prominent in people with psychiatric disorders. Depression is classically seen as following life events such as divorces and job losses. Longitudinal studies, however, have revealed that a significant fraction of the stressors associated with depression do in fact follow MDD onset, and that they are likely due to maladaptive behaviors prominent in MDD (Kendler et al., 1999). Clinically effective ’talking’ therapies for MDD such as cognitive and dialectical behavior therapies (DeRubeis et al., 1999; Bortolotti et al., 2008; Gotlib and Hammen, 2002; Power, 2005) explicitly concentrate on altering patients’ maladaptive behaviors and decision making processes. Decision making is a promising avenue into psychiatry for at least two more reasons. First, it offers powerful analytical tools. Control problems related to decision making are prevalent in a huge diversity of fields, ranging from ecology to economics, computer science and engineering. These fields have produced well-founded and thoroughly characterized frameworks within which many issues in decision making can be framed. Here, we will focus on framing issues identified in psychiatric settings within a normative decision making framework. Its second major strength comes from its relationship to neurobiology, and particularly those neuromodulatory systems which are powerfully affected by all major clinically effective pharmacotherapies in psychiatry. The understanding of these systems has benefited significantly from theoretical accounts of optimal control such as reinforcement learning (Montague et al., 1996; Kapur and Remington, 1996; Smith et al., 1999; Yu and Dayan, 2005; Dayan and Yu, 2006). Such accounts may be useful to identify in more specific terms the roles of the neuromodulators in psychiatry (Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008; Dayan and Huys, 2008). ∗ qhuys@cantab.net, joshuav@jhu.edu, dayan@gatsby.ucl.ac.uk; www.gatsby.ucl.ac.uk/∼qhuys/pub.html 1 Master Yoked Control Figure 1: The learned helplessness (LH) paradigm. Three sets of rats are used in a sequence of two tasks. In the first task, rats are exposed to escapable or inescapable shocks. Shocks come on at random times. The master rat is given escapable shocks: it can switch off the shock by performing an action, usually turning a wheel mounted in front of it. The yoked rat is exposed to precisely the same shocks as the master rat, i.e its shocks are terminated when the master rat terminates the shock. Thus its shocks are inescapable, there is nothing it can do itself to terminate them. A third set of rats is not exposed to shocks. Then, all three sets of rats are exposed to a shuttlebox escape task. Shocks again come on at random times, and rats have to shuttle to the other side of the box to terminate the shock. Only yoked rats fail to acquire the escape response. Yoked rats generally fail to acquire a wide variety of instrumental behaviours, either determined by reward or, as here, by punishment contingencies. This paper represents an initial attempt at validating this approach experimentally. We will frame core notions of MDD in a reinforcement learning framework and use it to design behavioral decision making experiments. More specifically, we will concentrate on two concepts central to current thinking about MDD: anhedonia and learned helplessness (LH, Maier and Seligman 1976; Maier and Watkins 2005). We formulate helplessness parametrically as prior beliefs on aspects of decision trees, and anhedonia as the effective reward size. This allows us to use choice behavior to infer the degree to which subjects’ behavioral choices are characterized by either of these. For validation, we correlate the parameters inferred from subjects’ behavior with standard, questionnaire-based measures of hopelessness and anhedonia, and finally use the inferred parameters alone to attempt to recover the diagnostic classification. 1 Core concepts: helplessness and anhedonia The basic LH paradigm is explained in figure 1. Its importance is manifold: the effect of inescapable shock on subsequent learning is sensitive to most classes of clinically effective antidepressants; it has arguably been a motivation framework for the development of the main talking therapies for depression (cognitive behavioural therapy, Williams (1992), it has motivated the development of further, yet more specific animal models (Willner, 1997), and it has been the basis of very specific research into the cognitive basis of depression (Peterson et al., 1993). Behavioral control is the central concept in LH: yoked and master rat do not differ in terms of the amount of shock (stress) they have experienced, only in terms of the behavioural control over it. It is not a standard notion in reinforcement learning, and there are several ways one could translate the concept into RL terms. At a simple level, there is intuitively more behavioural control if, when repeating one action, the same outcome occurs again and again, than if this were not true. Thus, at a very first level, control might be related to the outcome entropy of actions (see Maier and Seligman 1976 for an early formulation). Of course, this is too simple. If all available actions deterministically led to the same outcome, the agent has very little control. Finally, if one were able to achieve all outcomes except for the one one cares about (in the rats’ case switching off or avoiding the shock), we would again not say that there is much control (see Huys (2007); Huys and Dayan (2007) for a more detailed discussion). Despite its obvious limitations, we will here concentrate on the simplest notion for reasons of mathematical expediency. 2 0.6 0.5 Exploration vs Exploitation Predictive Distributions Q(aknown)−Q(aunknown) P(reward a known ) 0.7 2 0 1 2 3 4 5 0.4 0.3 0.2 Choose blue slot machine 0.5 0 −0.5 0.1 0 1 1 2 3 4 5 Reward −1 Choose orange slot machine 1 High control Low control 2 3 4 5 Tree depth Figure 2: Effect of γ on predictions, Q-values and exploration behaviour. Assume a slot machine (blue) has been chosen five times, with possible rewards 1-5, and that reward 2 has been obtained twice, and reward 4 three times (inset in left panel). Left: Predictive distribution for a prior with negative γ (low control) in light gray, and large γ (extensive control) in dark gray. We see that, if the agent believes he has much control (and outcome distributions have low entropy), the predictive distribution puts all mass on the observations. Right: Assume now the agent gets up to 5 more pulls (tree depth 1-5) between the blue slot machine and a new, orange slot machine. The orange slot machine’s predictive distribution is flat as it has never been tried, and its expected value is therefore 3. The plot shows the difference between the values for the two slot machines. First consider the agent only has one more pull to take. In this case, independently of the priors about control, the agent will choose the blue machine, because it is just slightly better than average. Note though that the difference is more pronounced if the agent has a high control prior. But things change if the agent has two or more choices. Now, it is worth trying out the new machine if the agent has a high-control prior. For in that case, if the new machine turns out to yield a large reward on the first try, it is likely to do so again for the second and subsequent times. Thus, the prior about control determines the exploration bonus. The second central concept in current conceptions of MDD is that of reward sensitivity. Anhedonia, an inability to enjoy previously enjoyable things, is one of two symptoms necessary for the diagnosis of depression (American Psychiatric Association, 1994). A number of tasks in the literature have attempted to measure reward sensitivity behaviourally. While these generally concur in finding decreased reward sensitivity in subjects with MDD, these results need further clarification. Some studies show interactions between reward and punishment sensitivities with respect to MDD, but important aspects of the tasks are not clearly understood. For instance, Henriques et al. (1994); Henriques and Davidson (2000) show decreased resonsiveness of MDD subjects to rewards, but equally show decreased resonsiveness of healthy subjects to punishments. Pizzagalli et al. (2005) introduced an asymmetrically rewarded perceptual discrimination task and show that the rate of change of the response bias is anticorrelated with subjects’ anhedonic symptoms. Exactly how decreased reward responsivity can account for this is at pressent not clear. Great care has to be taken to disentangle these two concepts. Anhedonia and helplessness both provide good reasons for not taking an action: either because the reinforcements associated with the action are insufficient (anhedonia), or because the outcome is not judged a likely result of taking some particular action (if actions are thought to have large outcome entropy). 2 A Bayesian formulation of control We consider a scenario where subjects have no knowledge of the outcome distributions of actions, but rather learn about them. This means that their prior beliefs about the outcome distributions are not overwhelmed by the likelihood of observations, and may thus have measurable effects on their action choices. In terms of RL, this means that agents do not know the decision tree of the problem they face. Control is formulated as a prior distribution on the outcome distributions, and thereby as a prior distribution on the decision trees. The concentration parameter α of a Dirichlet process can very simply parametrise entropy, and, if used as a prior, allow for very efficient updates of the predictive distributions of actions. Let us assume we have actions A which have as outcomes rewards R, and keep count Nt (r, a) = 3 k:k < 0. Here, we included a regressor for the AGE as that was a confounding variable in our subject sample. Furthermore, if it is true that anhedonia, as expressed by the questionnaire, relates to reward sensitivity specifically, we should be able to write a similar regression for the learning rate ǫ (from equation 5) ǫ(BDIa, AGE) = θǫ BDIa + cǫ AGE + ζǫ but find that θǫ is not different from zero. Figure 4 shows the ML values for the parameters of interest (emphasized in blue in the equations) and confirms that people who express higher levels of anhedonia do indeed show less reward sensitivity, but do not differ in terms of learning rate. If it were the case that subjects with higher BDIa score were just less attentive to the task, one might also expect an effect of BDIa on learning rate. 3.2 Control Validation: The control task is new, and we first need to ascertain that subjects were indeed sensitive to main features of the task. We thus fit both a RW-learning rule (as in the previous section, but adjusted for the varying number of available actions), and the full control model. Importantly, both these models have two parameters, but only the full control model has a notion of outcome entropy, and evaluations a tree. The chance probability of subjects’ actions was 0.37, meaning that, on average, there were just under three machines on the screen. The probability of the actions under the RW-learning rule was better at 0.48, and that of the full control model 0.54. These differences are highly significant as the total number of choices is 29600. Thus, we conclude that subjects were indeed sensitive to the manipulation of outcome entropy, and that they did look ahead in a tree. Prior belief about control: Applying the procedure from the previous task to the main task, we write the main parameters of equations 2 and 4 as functions of the questionnaire measures and infer linear parameters: γ1 (BDIa, BHS, age) = χγ1 BHS + θγ1 BDIa + cγ1 AGE + ζγ1 γ2 (BDIa, BHS, age) = χγ2 BHS + θγ2 BDIa + cγ2 AGE + ζγ2 β(BDIa, BHS, age) = χβ BHS + θβ BDIa + cβ AGE + ζβ Importantly, because the BDIa scores and the BHS scores are correlated in our sample (they tend to be large for the subjects with MDD), we include the cross-terms (θγ1 , θγ2 , χγ ), as we are interested in the specific effects of BDIa on β, as before, and of BHS on γ. 6 3 control γ 2 Figure 6: Classification. Controls are shown as black dots, and depressed subjects as red crosses. The blue line is a linear classifier. Thus, the patients and controls can be approximately classified purely on the basis of behaviour. 1 0 83% correct 69% sensitivity 94% specificity −1 −2 2 4 6 8 10 12 14 16 reward sensitivity β We here infer and display two separate values γ1 and γ2 . These correspond to the level of control in the first and the second half of the experiment. In fact, to parallel the LH experiments better, the slot machines in the first 50 rooms were actually very noisy (low true γ), which means that subjects were here exposed to low levels of control just like the yoked rats in the original experiment. In the second half of the experiment on the other hand, slot machines tended to be quite reliable (high true γ). Figure 5 shows again the ML values for the parameters of interest (emphasized in blue in the equations). Again, we find that our parameter estimate are very significantly different from zero (> three standard deviations). The effect of the BHS score on the prior beliefs about control γ is much stronger in the second half than of the experiment in the first half, i.e. the effect of BHS on the prior belief about control is particularly prominent when subjects are in a high-control environment and have previously been exposed to a low-control environment. This is an interesting parallel to the learned helplessness experiments in animals. 3.3 Classification Finally we combine the two tasks. We integrate out the learning rate ǫ, which we had found not be related to the questionnaire measures (c.f. figure 4), and use the distribution over β from the first task as a prior distribution on β for the second task. We also put weak priors on γ and infer both β and γ for the second task on a subject-by-subject basis. Figure 6 shows the posterior values for γ and β for MDD and healthy subjects and the ability of a linear classifier to classify them. 4 Discussion In this paper, we have attempted to provide a specific formulation of core psychiatric concepts in reinforcement learning terms, i.e. hopelessness as a prior belief about controllability, and anhedonia as reward sensitivity. We have briefly explained how we expect these formulations to have effect in a behavioural situation, have presented a behavioral task explicitly designed to be sensitive to our formulations, and shown that people’s verbal expression of hopelessness and anhedonia do have specific behavioral impacts. Subjects who express anhedonia display insensitivity to rewards and those expressing hopelessness behave as if they had prior beliefs that outcome distributions of actions (slot machines) are very broad. Finally, we have shown that these purely behavioural measures are also predictive of their psychiatric status, in that we were able to classify patients and healthy controls purely on the basis of performance. Several aspects of this work are novel. There have been previous attempts to map aspects of psychiatric dysfunction onto specific parametrizations (Cohen et al., 1996; Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008), but we believe that our work represents the first attempt to a) apply it to MDD; b) make formal predictions about subject behavior c) present strong evidence linking anhedonia specifically to reward insensitivity across two tasks d) combine tasks to tease helplessness and anhedonia apart and e) to use the behavioral inferences for classification. The latter point is particularly important, as it will determine any potential clinical significance (Veiel, 1997). In the future, rather than cross-validating with respect to say DSM-IV criteria, it may also be important to validate measures such as ours in their own right in longitudinal studies. 7 Several important caveats do remain. First, the populations are not fully matched for age. We included age as an additional regressor and found all results to be robust. Secondly, only the healthy subjects were remunerated. However, repeating the analyses presented using only the MDD subjects yields the same results (data not shown). Thirdly, we have not yet fully mirrored the LH experiments. We have so far only tested the transfer from a low-control environment to a high-control environment. To make statements like those in animal learned helplessness experiments, the transfer from high-control to low-control environments will need to be examined, too. Fourth, the notion of control we have used is very simple, and more complex notions should certainly be tested (see Dayan and Huys 2008). Fifth, and maybe most importantly, we have so far only attempted to classify MDD and healthy subjects, and can thus not yet make any statements about the specificity of these effects with respect to MDD. Finally, it will be important to replicate these results independently, and possibly in a different modality. Nevertheless, we believe these results to be very encouraging. Acknowledgments: This work would not have been possible without the help of Sarah Hollingsworth Lisanby, Kenneth Miller and Ramin V. Parsey. We would also like to thank Nathaniel Daw and Hanneke EM Den Ouden and Ren´ Hen for invaluable discussions. Support for this work was provided by the Gatsby Charitable e Foundation (PD), a UCL Bogue Fellowship and the Swartz Foundation (QH) and a Columbia University startup grant to Kenneth Miller. References American Psychiatric Association (1994). Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Association Press. Bortolotti, B., Menchetti, M., Bellini, F., Montaguti, M. B., and Berardi, D. (2008). Psychological interventions for major depression in primary care: a meta-analytic review of randomized controlled trials. Gen Hosp Psychiatry, 30(4):293–302. Cohen, J. D., Braver, T. S., and O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control and schizophrenia: recent developments and current challenges. Philos Trans R Soc Lond B Biol Sci, 351(1346):1515–1527. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., and Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879. Dayan, P. and Huys, Q. J. M. (2008). Serotonin, inhibition, and negative mood. PLoS Comput Biol, 4(2):e4. Dayan, P. and Yu, A. J. (2006). Phasic norepinephrine: a neural interrupt signal for unexpected events. Network, 17(4):335– 350. DeRubeis, R. J., Gelfand, L. A., Tang, T. Z., and Simons, A. D. (1999). Medications versus cognitive behavior therapy for severely depressed outpatients: mega-analysis of four randomized comparisons. Am J Psychiatry, 156(7):1007–1013. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002a). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Non-Patient Edition. (SCID-I/NP). Biometrics Research, New York State Psychiatric Institute. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002b). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Patient Edition. (SCID-I/P). Biometrics Research, New York State Psychiatric Institute. Gotlib, I. H. and Hammen, C. L., editors (2002). Handbook of Depression. The Guilford Press. Henriques, J. B. and Davidson, R. J. (2000). Decreased responsiveness to reward in depression. Cognition and Emotion, 14(5):711–24. Henriques, J. B., Glowacki, J. M., and Davidson, R. J. (1994). Reward fails to alter response bias in depression. J Abnorm Psychol, 103(3):460–6. Huys, Q. J. M. (2007). Reinforcers and control. Towards a computational ætiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of London. Huys, Q. J. M. and Dayan, P. (2007). A bayesian formulation of behavioral control. Under Review, 0:00. Kapur, S. and Remington, G. (1996). Serotonin-dopamine interaction and its relevance to schizophrenia. Am J Psychiatry, 153(4):466–76. Kendler, K. S., Karkowski, L. M., and Prescott, C. A. (1999). Causal relationship between stressful life events and the onset of major depression. Am. J. Psychiatry, 156:837–41. Maier, S. and Seligman, M. (1976). Learned Helplessness: Theory and Evidence. Journal of Experimental Psychology: General, 105(1):3–46. Maier, S. F. and Watkins, L. R. (2005). Stressor controllability and learned helplessness: the roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci. Biobehav. Rev., 29(4-5):829–41. Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci., 16(5):1936–47. Moutoussis, M., Bentall, R. P., Williams, J., and Dayan, P. (2008). A temporal difference account of avoidance learning. Network, 19(2):137–160. Peterson, C., Maier, S. F., and Seligman, M. E. P. (1993). Learned Helplessness: A theory for the age of personal control. OUP, Oxford, UK. Pizzagalli, D. A., Jahn, A. L., and O’Shea, J. P. (2005). Toward an objective characterization of an anhedonic phenotype: a signal-detection approach. Biol Psychiatry, 57(4):319–327. Power, M., editor (2005). Mood Disorders: A Handbook of Science and Practice. John Wiley and Sons, paperback edition. Smith, A., Li, M., Becker, S., and Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: a computational approach. Neuropsychopharm., 29(6):1040–9. Smith, K. A., Morris, J. S., Friston, K. J., Cowen, P. J., and Dolan, R. J. (1999). Brain mechanisms associated with depressive relapse and associated cognitive impairment following acute tryptophan depletion. Br. J. Psychiatry, 174:525–9. Veiel, H. O. F. (1997). A preliminary profile of neuropsychological deficits associated with major depression. J. Clin. Exp. Neuropsychol., 19:587–603. Williams, J. and Dayan, P. (2005). Dopamine, learning, and impulsivity: a biological account of attentiondeficit/hyperactivity disorder. J Child Adolesc Psychopharmacol, 15(2):160–79; discussion 157–9. Williams, J. M. G. (1992). The psychological treatment of depression. Routledge. Willner, P. (1997). Validity, reliability and utility of the chronic mild stress model of depression: a 10-year review and evaluation. Psychopharm, 134:319–29. Yu, A. J. and Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4):681–692. 8
2 0.75570202 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
Author: Carmen Sandi, Wulfram Gerstner, Gediminas Lukšys
Abstract: Suppose we train an animal in a conditioning experiment. Can one predict how a given animal, under given experimental conditions, would perform the task? Since various factors such as stress, motivation, genetic background, and previous errors in task performance can influence animal behaviour, this appears to be a very challenging aim. Reinforcement learning (RL) models have been successful in modeling animal (and human) behaviour, but their success has been limited because of uncertainty as to how to set meta-parameters (such as learning rate, exploitation-exploration balance and future reward discount factor) that strongly influence model performance. We show that a simple RL model whose metaparameters are controlled by an artificial neural network, fed with inputs such as stress, affective phenotype, previous task performance, and even neuromodulatory manipulations, can successfully predict mouse behaviour in the ”hole-box” - a simple conditioning task. Our results also provide important insights on how stress and anxiety affect animal learning, performance accuracy, and discounting of future rewards, and on how noradrenergic systems can interact with these processes. 1
3 0.68426359 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
Author: Michael T. Todd, Yael Niv, Jonathan D. Cohen
Abstract: Working memory is a central topic of cognitive neuroscience because it is critical for solving real-world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating framework [14] is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We unite Gating with machine learning theory concerning the general problem of memory-based optimal control [5-6]. We present a normative model that learns, by online temporal difference methods, to use working memory to maximize discounted future reward in partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in humans. Our purpose is to introduce a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards. 1 I n t ro d u c t i o n Working memory is loosely defined in cognitive neuroscience as information that is (1) internally maintained on a temporary or short term basis, and (2) required for tasks in which immediate observations cannot be mapped to correct actions. It is widely assumed that prefrontal cortex (PFC) plays a role in maintaining and updating working memory. However, relatively little is known about how PFC develops useful working memory representations for a new task. Furthermore, current work focuses on describing the structure and limitations of working memory, but does not ask why, or in what general class of tasks, is it necessary. Borrowing from the theory of optimal control in partially observable Markov decision problems (POMDPs), we frame the psychological concept of working memory as an internal state representation, developed and employed to maximize future reward in partially observable environments. We combine computational insights from POMDPs and neurobiologically plausible models from cognitive neuroscience to suggest a simple reinforcement learning (RL) model of working memory function that can be implemented through dopaminergic training of the basal ganglia and PFC. The Gating framework is a series of cognitive neuroscience models developed to explain how dopaminergic RL signals can shape useful working memory representations [1-4]. Computationally this framework models working memory as a collection of past observations, each of which can occasionally be replaced with the current observation, and addresses the problem of learning when to update each memory element versus maintaining it. In the original Gating model [1-2] the PFC contained a unitary working memory representation that was updated whenever a phasic dopamine (DA) burst occurred (e.g., due to unexpected reward or novelty). That model was the first to connect working memory and RL via the temporal difference (TD) model of DA firing [7-8], and thus to suggest how working memory might serve a normative purpose. However, that model had limited computational flexibility due to the unitary nature of the working memory (i.e., a singleobservation memory controlled by a scalar DA signal). More recent work [3-4] has partially repositioned the Gating framework within the Actor/Critic model of mesostriatal RL [9-10], positing memory updating as but another cortical action controlled by the dorsal striatal
4 0.61782241 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
Author: Matthew Botvinick, James An
Abstract: Research in animal learning and behavioral neuroscience has distinguished between two forms of action control: a habit-based form, which relies on stored action values, and a goal-directed form, which forecasts and compares action outcomes based on a model of the environment. While habit-based control has been the subject of extensive computational research, the computational principles underlying goal-directed control in animals have so far received less attention. In the present paper, we advance a computational framework for goal-directed control in animals and humans. We take three empirically motivated points as founding premises: (1) Neurons in dorsolateral prefrontal cortex represent action policies, (2) Neurons in orbitofrontal cortex represent rewards, and (3) Neural computation, across domains, can be appropriately understood as performing structured probabilistic inference. On a purely computational level, the resulting account relates closely to previous work using Bayesian inference to solve Markov decision problems, but extends this work by introducing a new algorithm, which provably converges on optimal plans. On a cognitive and neuroscientific level, the theory provides a unifying framework for several different forms of goal-directed action selection, placing emphasis on a novel form, within which orbitofrontal reward representations directly drive policy selection. 1 G oal- d irect ed act i on cont rol In the study of human and animal behavior, it is a long-standing idea that reward-based decision making may rely on two qualitatively different mechanisms. In habit-based decision making, stimuli elicit reflex-like responses, shaped by past reinforcement [1]. In goal-directed or purposive decision making, on the other hand, actions are selected based on a prospective consideration of possible outcomes and future lines of action [2]. Over the past twenty years or so, the attention of cognitive neuroscientists and computationally minded psychologists has tended to focus on habit-based control, due in large part to interest in potential links between dopaminergic function and temporal-difference algorithms for reinforcement learning. However, a resurgence of interest in purposive action selection is now being driven by innovations in animal behavior research, which have yielded powerful new behavioral assays [3], and revealed specific effects of focal neural damage on goaldirected behavior [4]. In discussing some of the relevant data, Daw, Niv and Dayan [5] recently pointed out the close relationship between purposive decision making, as understood in the behavioral sciences, and model-based methods for the solution of Markov decision problems (MDPs), where action policies are derived from a joint analysis of a transition function (a mapping from states and actions to outcomes) and a reward function (a mapping from states to rewards). Beyond this important insight, little work has yet been done to characterize the computations underlying goal-directed action selection (though see [6, 7]). As discussed below, a great deal of evidence indicates that purposive action selection depends critically on a particular region of the brain, the prefrontal cortex. However, it is currently a critical, and quite open, question what the relevant computations within this part of the brain might be. Of course, the basic computational problem of formulating an optimal policy given a model of an MDP has been extensively studied, and there is no shortage of algorithms one might consider as potentially relevant to prefrontal function (e.g., value iteration, policy iteration, backward induction, linear programming, and others). However, from a cognitive and neuroscientific perspective, there is one approach to solving MDPs that it seems particularly appealing to consider. In particular, several researchers have suggested methods for solving MDPs through probabilistic inference [8-12]. The interest of this idea, in the present context, derives from a recent movement toward framing human and animal information processing, as well as the underlying neural computations, in terms of structured probabilistic inference [13, 14]. Given this perspective, it is inviting to consider whether goal-directed action selection, and the neural mechanisms that underlie it, might be understood in those same terms. One challenge in investigating this possibility is that previous research furnishes no ‘off-theshelf’ algorithm for solving MDPs through probabilistic inference that both provably yields optimal policies and aligns with what is known about action selection in the brain. We endeavor here to start filling in that gap. In the following section, we introduce an account of how goal-directed action selection can be performed based on probabilisitic inference, within a network whose components map grossly onto specific brain structures. As part of this account, we introduce a new algorithm for solving MDPs through Bayesian inference, along with a convergence proof. We then present results from a set of simulations illustrating how the framework would account for a variety of behavioral phenomena that are thought to involve purposive action selection. 2 Co m p u t a t i o n a l m o d el As noted earlier, the prefrontal cortex (PFC) is believed to play a pivotal role in purposive behavior. This is indicated by a broad association between prefrontal lesions and impairments in goal-directed action in both humans (see [15]) and animals [4]. Single-unit recording and other data suggest that different sectors of PFC make distinct contributions. In particular, neurons in dorsolateral prefrontal cortex (DLPFC) appear to encode taskspecific mappings from stimuli to responses (e.g., [16]): “task representations,” in the language of psychology, or “policies” in the language of dynamic programming. Although there is some understanding of how policy representations in DLPFC may guide action execution [15], little is yet known about how these representations are themselves selected. Our most basic proposal is that DLPFC policy representations are selected in a prospective, model-based fashion, leveraging information about action-outcome contingencies (i.e., the transition function) and about the incentive value associated with specific outcomes or states (the reward function). There is extensive evidence to suggest that state-reward associations are represented in another area of the PFC, the orbitofrontal cortex (OFC) [17, 18]. As for the transition function, although it is clear that the brain contains detailed representations of action-outcome associations [19], their anatomical localization is not yet entirely clear. However, some evidence suggests that the enviromental effects of simple actions may be represented in inferior fronto-parietal cortex [20], and there is also evidence suggesting that medial temporal structures may be important in forecasting action outcomes [21]. As detailed in the next section, our model assumes that policy representations in DLPFC, reward representations in OFC, and representations of states and actions in other brain regions, are coordinated within a network structure that represents their causal or statistical interdependencies, and that policy selection occurs, within this network, through a process of probabilistic inference. 2.1 A rc h i t e c t u re The implementation takes the form of a directed graphical model [22], with the layout shown in Figure 1. Each node represents a discrete random variable. State variables (s), representing the set of m possible world states, serve the role played by parietal and medial temporal cortices in representing action outcomes. Action variables (a) representing the set of available actions, play the role of high-level cortical motor areas involved in the programming of action sequences. Policy variables ( ), each repre-senting the set of all deterministic policies associated with a specific state, capture the representational role of DLPFC. Local and global utility variables, described further Fig 1. Left: Single-step decision. Right: Sequential decision. below, capture the role of OFC in Each time-slice includes a set of m policy nodes. representing incentive value. A separate set of nodes is included for each discrete time-step up to the planning horizon. The conditional probabilities associated with each variable are represented in tabular form. State probabilities are based on the state and action variables in the preceding time-step, and thus encode the transition function. Action probabilities depend on the current state and its associated policy variable. Utilities depend only on the current state. Rather than representing reward magnitude as a continuous variable, we adopt an approach introduced by [23], representing reward through the posterior probability of a binary variable (u). States associated with large positive reward raise p(u) (i.e, p(u=1|s)) near to one; states associated with large negative rewards reduce p(u) to near zero. In the simulations reported below, we used a simple linear transformation to map from scalar reward values to p(u): p (u si ) = 1 R ( si ) +1 , rmax 2 rmax max j R ( s j ) (1) In situations involving sequential actions, expected returns from different time-steps must be integrated into a global representation of expected value. In order to accomplish this, we employ a technique proposed by [8], introducing a “global” utility variable (u G). Like u, this 1 is a binary random variable, but associated with a posterior probability determined as: p (uG ) = 1 N p(u i ) (2) i where N is the number of u nodes. The network as whole embodies a generative model for instrumental action. The basic idea is to use this model as a substrate for probabilistic inference, in order to arrive at optimal policies. There are three general methods for accomplishing this, which correspond three forms of query. First, a desired outcome state can be identified, by treating one of the state variables (as well as the initial state variable) as observed (see [9] for an application of this approach). Second, the expected return for specific plans can be evaluated and compared by conditioning on specific sets of values over the policy nodes (see [5, 21]). However, our focus here is on a less obvious possibility, which is to condition directly on the utility variable u G , as explained next. 2.2 P o l i c y s e l e c t i o n b y p ro b a b i l i s t i c i n f e re n c e : a n i t e r a t i v e a l g o r i t h m Cooper [23] introduced the idea of inferring optimal decisions in influence diagrams by treating utility nodes into binary random variables and then conditioning on these variables. Although this technique has been adopted in some more recent work [9, 12], we are aware of no application that guarantees optimal decisions, in the expected-reward sense, in multi-step tasks. We introduce here a simple algorithm that does furnish such a guarantee. The procedure is as follows: (1) Initialize the policy nodes with any set of non-deterministic 2 priors. (2) Treating the initial state and u G as observed variables (u G = 1), use standard belief 1 Note that temporal discounting can be incorporated into the framework through minimal modifications to Equation 2. 2 In the single-action situation, where there is only one u node, it is this variable that is treated as observed (u = 1). propagation (or a comparable algorithm) to infer the posterior distributions over all policy nodes. (3) Set the prior distributions over the policy nodes to the values (posteriors) obtained in step 2. (4) Go to step 2. The next two sections present proofs of monotonicity and convergence for this algorithm. 2.2.1 Monotonicity We show first that, at each policy node, the probability associated with the optimal policy will rise on every iteration. Define * as follows: ( * p uG , + ) > p (u + , G ), * (3) where + is the current set of probability distributions at all policy nodes on subsequent time-steps. (Note that we assume here, for simplicity, that there is a unique optimal policy.) The objective is to establish that: p ( t* ) > p ( t* 1 ) (4) where t indexes processing iterations. The dynamics of the network entail that p( ) = p( t t 1 uG ) (5) where represents any value (i.e., policy) of the decision node being considered. Substituting this into (4) gives p t* 1 uG > p ( t* 1 ) (6) ( ) From this point on the focus is on a single iteration, which permits us to omit the relevant subscripts. Applying Bayes’ law to (6) yields p (uG * p (uG ) p( ) > p * )p ( ) ( ) * (7) Canceling, and bringing the denominator up, this becomes p (uG * )> p (uG ) p( ) (8) Rewriting the left hand side, we obtain p ( uG * ) p( ) > p (uG ) p( ) (9) Subtracting and further rearranging: p (uG p (uG * ) p ( uG * * * ) p (uG ) p( ) + p (uG * * ) * p ( uG ) p( ) > 0 p (uG * ) p (uG ) p( ) > 0 (10) ) p( ) > 0 (11) (12) Note that this last inequality (12) follows from the definition of *. Remark: Of course, the identity of * depends on +. In particular, the policy * will only be part of a globally optimal plan if the set of choices + is optimal. Fortunately, this requirement is guaranteed to be met, as long as no upper bound is placed on the number of processing cycles. Recalling that we are considering only finite-horizon problems, note that for policies leading to states with no successors, + is empty. Thus * at the relevant policy nodes is fixed, and is guaranteed to be part of the optimal policy. The proof above shows that * will continuously rise. Once it reaches a maximum, * at immediately preceding decisions will perforce fit with the globally optimal policy. The process works backward, in the fashion of backward induction. 2.2.2 Convergence Continuing with the same notation, we show now that pt ( limt uG ) = 1 * (13) Note that, if we apply Bayes’ law recursively, pt ( uG ) = ( ) p ( ) = p (u p uG t ) G pi (uG ) 2 pt pi (uG ) pt 1 ( ( )= ) p uG 1 ( uG ) pt (uG ) pt 3 pt 2 ( ) 1 ( u G ) pt 2 ( u G ) … (14) Thus, p1 ( uG ) = ( p uG ) p ( ), p ( p (u ) 1 uG ) = 2 1 G 2 ( ) p ( ), p uG 1 p2 (uG ) p1 (uG ) p3 ( 3 ( ) p( ) p uG uG ) = 1 p3 (uG ) p2 (uG ) p1 (uG ) , (15) and so forth. Thus, what we wish to prove is ( * p uG ) p ( ) =1 * 1 (16) pt (uG ) t =1 or, rearranging, pt (uG ) ( = p1 ( ) p uG t =1 (17) ). Note that, given the stipulated relationship between p( ) on each processing iteration and p( | uG) on the previous iteration, p (uG pt (uG ) = )p ( ) = p ( uG = pt 1 )p ( p (uG t uG ) = t 1 3 )p ( ) 4 p (uG t 1 = (uG ) pt 2 (uG ) pt 1 ) pt 2 p (uG )p ( ) t 1 pt 1 1 ( ) (uG ) pt 2 (uG ) pt 3 (uG ) ( uG ) (18) … With this in mind, we can rewrite the left hand side product in (17) as follows: p ( uG p1 (uG ) ( p uG ) p (u G 2 )p( ) ) p (u 1 G ) 3 p (uG 1 ( p uG )p( ) ) p (u 1 G 4 p (uG 1 ( ) p2 (uG ) p uG ) p (u 1 ) p( ) 1 G ) p2 (uG ) p3 (uG ) … (19) Note that, given (18), the numerator in each factor of (19) cancels with the denominator in the subsequent factor, leaving only p(uG| *) in that denominator. The expression can thus be rewritten as 1 ( p uG 1 ) p (u G ) p (u G 4 p (uG 1 ) ) p( ) 1 ( p uG ) … = p (uG ( p uG ) ) p1 ( ). (20) The objective is then to show that the above equals p( *). It proceeds directly from the definition of * that, for all other than *, p ( uG ( p uG ) ) <1 (21) Thus, all but one of the terms in the sum above approach zero, and the remaining term equals p1( *). Thus, p (uG ( p uG ) ) p1 ( ) = p1 ( ) (22) 3 Simulations 3.1 Binary choice We begin with a simulation of a simple incentive choice situation. Here, an animal faces two levers. Pressing the left lever reliably yields a preferred food (r = 2), the right a less preferred food (r = 1). Representing these contingencies in a network structured as in Fig. 1 (left) and employing the iterative algorithm described in section 2.2 yields the results in Figure 2A. Shown here are the posterior probabilities for the policies press left and press right, along with the marginal value of p(u = 1) under these posteriors (labeled EV for expected value). The dashed horizontal line indicates the expected value for the optimal plan, to which the model obviously converges. A key empirical assay for purposive behavior involves outcome devaluation. Here, actions yielding a previously valued outcome are abandoned after the incentive value of the outcome is reduced, for example by pairing with an aversive event (e.g., [4]). To simulate this within the binary choice scenario just described, we reduced to zero the reward value of the food yielded by the left lever (fL), by making the appropriate change to p(u|fL). This yielded a reversal in lever choice (Fig. 2B). Another signature of purposive actions is that they are abandoned when their causal connection with rewarding outcomes is removed (contingency degradation, see [4]). We simulated this by starting with the model from Fig. 2A and changing conditional probabilities at s for t=2 to reflect a decoupling of the left action from the fL outcome. The resulting behavior is shown in Fig. 2C. Fig 2. Simulation results, binary choice. 3.2 Stochastic outcomes A critical aspect of the present modeling paradigm is that it yields reward-maximizing choices in stochastic domains, a property that distinguishes it from some other recent approaches using graphical models to do planning (e.g., [9]). To illustrate, we used the architecture in Figure 1 (left) to simulate a choice between two fair coins. A ‘left’ coin yields $1 for heads, $0 for tails; a ‘right’ coin $2 for heads but for tails a $3 loss. As illustrated in Fig. 2D, the model maximizes expected value by opting for the left coin. Fig 3. Simulation results, two-step sequential choice. 3.3 Sequential decision Here, we adopt the two-step T-maze scenario used by [24] (Fig. 3A). Representing the task contingencies in a graphical model based on the template from Fig 1 (right), and using the reward values indicated in Fig. 3A, yields the choice behavior shown in Figure 3B. Following [24], a shift in motivational state from hunger to thirst can be represented in the graphical model by changing the reward function (R(cheese) = 2, R(X) = 0, R(water) = 4, R(carrots) = 1). Imposing this change at the level of the u variables yields the choice behavior shown in Fig. 3C. The model can also be used to simulate effort-based decision. Starting with the scenario in Fig. 2A, we simulated the insertion of an effort-demanding scalable barrier at S 2 (R(S 2 ) = -2) by making appropriate changes p(u|s). The resulting behavior is shown in Fig. 3D. A famous empirical demonstration of purposive control involves detour behavior. Using a maze like the one shown in Fig. 4A, with a food reward placed at s5 , Tolman [2] found that rats reacted to a barrier at location A by taking the upper route, but to a barrier at B by taking the longer lower route. We simulated this experiment by representing the corresponding 3 transition and reward functions in a graphical model of the form shown in Fig. 1 (right), representing the insertion of barriers by appropriate changes to the transition function. The resulting choice behavior at the critical juncture s2 is shown in Fig. 4. Fig 4. Simulation results, detour behavior. B: No barrier. C: Barrier at A. D: Barrier at B. Another classic empirical demonstration involves latent learning. Blodgett [25] allowed rats to explore the maze shown in Fig. 5. Later insertion of a food reward at s13 was followed immediately by dramatic reductions in the running time, reflecting a reduction in entries into blind alleys. We simulated this effect in a model based on the template in Fig. 1 (right), representing the maze layout via an appropriate transition function. In the absence of a reward at s12 , random choices occurred at each intersection. However, setting R(s13 ) = 1 resulted in the set of choices indicated by the heavier arrows in Fig. 5. 4 Fig 5. Latent learning. Rel a t i o n t o p revi o u s work Initial proposals for how to solve decision problems through probabilistic inference in graphical models, including the idea of encoding reward as the posterior probability of a random utility variable, were put forth by Cooper [23]. Related ideas were presented by Shachter and Peot [12], including the use of nodes that integrate information from multiple utility nodes. More recently, Attias [11] and Verma and Rao [9] have used graphical models to solve shortest-path problems, leveraging probabilistic representations of rewards, though not in a way that guaranteed convergence on optimal (reward maximizing) plans. More closely related to the present research is work by Toussaint and Storkey [10], employing the EM algorithm. The iterative approach we have introduced here has a certain resemblance to the EM procedure, which becomes evident if one views the policy variables in our models as parameters on the mapping from states to actions. It seems possible that there may be a formal equivalence between the algorithm we have proposed and the one reported by [10]. As a cognitive and neuroscientific proposal, the present work bears a close relation to recent work by Hasselmo [6], addressing the prefrontal computations underlying goal-directed action selection (see also [7]). The present efforts are tied more closely to normative principles of decision-making, whereas the work in [6] is tied more closely to the details of neural circuitry. In this respect, the two approaches may prove complementary, and it will be interesting to further consider their interrelations. 3 In this simulation and the next, the set of states associated with each state node was limited to the set of reachable states for the relevant time-step, assuming an initial state of s1 . Acknowledgments Thanks to Andrew Ledvina, David Blei, Yael Niv, Nathaniel Daw, and Francisco Pereira for useful comments. R e f e re n c e s [1] Hull, C.L., Principles of Behavior. 1943, New York: Appleton-Century. [2] Tolman, E.C., Purposive Behavior in Animals and Men. 1932, New York: Century. [3] Dickinson, A., Actions and habits: the development of behavioral autonomy. Philosophical Transactions of the Royal Society (London), Series B, 1985. 308: p. 67-78. [4] Balleine, B.W. and A. Dickinson, Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology, 1998. 37: p. 407-419. [5] Daw, N.D., Y. Niv, and P. Dayan, Uncertainty-based competition between prefrontal and striatal systems for behavioral control. Nature Neuroscience, 2005. 8: p. 1704-1711. [6] Hasselmo, M.E., A model of prefrontal cortical mechanisms for goal-directed behavior. Journal of Cognitive Neuroscience, 2005. 17: p. 1115-1129. [7] Schmajuk, N.A. and A.D. Thieme, Purposive behavior and cognitive mapping. A neural network model. Biological Cybernetics, 1992. 67: p. 165-174. [8] Tatman, J.A. and R.D. Shachter, Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man and Cybernetics, 1990. 20: p. 365-379. [9] Verma, D. and R.P.N. Rao. Planning and acting in uncertain enviroments using probabilistic inference. in IEEE/RSJ International Conference on Intelligent Robots and Systems. 2006. [10] Toussaint, M. and A. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. in Proceedings of the 23rd International Conference on Machine Learning. 2006. Pittsburgh, PA. [11] Attias, H. Planning by probabilistic inference. in Proceedings of the 9th Int. Workshop on Artificial Intelligence and Statistics. 2003. [12] Shachter, R.D. and M.A. Peot. Decision making using probabilistic inference methods. in Uncertainty in artificial intelligence: Proceedings of the Eighth Conference (1992). 1992. Stanford University: M. Kaufmann. [13] Chater, N., J.B. Tenenbaum, and A. Yuille, Probabilistic models of cognition: conceptual foundations. Trends in Cognitive Sciences, 2006. 10(7): p. 287-291. [14] Doya, K., et al., eds. The Bayesian Brain: Probabilistic Approaches to Neural Coding. 2006, MIT Press: Cambridge, MA. [15] Miller, E.K. and J.D. Cohen, An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 2001. 24: p. 167-202. [16] Asaad, W.F., G. Rainer, and E.K. Miller, Task-specific neural activity in the primate prefrontal cortex. Journal of Neurophysiology, 2000. 84: p. 451-459. [17] Rolls, E.T., The functions of the orbitofrontal cortex. Brain and Cognition, 2004. 55: p. 11-29. [18] Padoa-Schioppa, C. and J.A. Assad, Neurons in the orbitofrontal cortex encode economic value. Nature, 2006. 441: p. 223-226. [19] Gopnik, A., et al., A theory of causal learning in children: causal maps and Bayes nets. Psychological Review, 2004. 111: p. 1-31. [20] Hamilton, A.F.d.C. and S.T. Grafton, Action outcomes are represented in human inferior frontoparietal cortex. Cerebral Cortex, 2008. 18: p. 1160-1168. [21] Johnson, A., M.A.A. van der Meer, and D.A. Redish, Integrating hippocampus and striatum in decision-making. Current Opinion in Neurobiology, 2008. 17: p. 692-697. [22] Jensen, F.V., Bayesian Networks and Decision Graphs. 2001, New York: Springer Verlag. [23] Cooper, G.F. A method for using belief networks as influence diagrams. in Fourth Workshop on Uncertainty in Artificial Intelligence. 1988. University of Minnesota, Minneapolis. [24] Niv, Y., D. Joel, and P. Dayan, A normative perspective on motivation. Trends in Cognitive Sciences, 2006. 10: p. 375-381. [25] Blodgett, H.C., The effect of the introduction of reward upon the maze performance of rats. University of California Publications in Psychology, 1929. 4: p. 113-134.
5 0.61442906 223 nips-2008-Structure Learning in Human Sequential Decision-Making
Author: Daniel Acuna, Paul R. Schrater
Abstract: We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1
6 0.60337865 33 nips-2008-Bayesian Model of Behaviour in Economic Games
7 0.53570753 231 nips-2008-Temporal Dynamics of Cognitive Control
8 0.53156799 172 nips-2008-Optimal Response Initiation: Why Recent Experience Matters
9 0.50806707 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
10 0.5047515 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
11 0.49483305 7 nips-2008-A computational model of hippocampal function in trace conditioning
12 0.48740393 180 nips-2008-Playing Pinball with non-invasive BCI
13 0.48339868 212 nips-2008-Skill Characterization Based on Betweenness
14 0.39997452 10 nips-2008-A rational model of preference learning and choice prediction by children
15 0.39792171 124 nips-2008-Load and Attentional Bayes
16 0.39528582 206 nips-2008-Sequential effects: Superstition or rational behavior?
17 0.38626209 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making
18 0.35698217 67 nips-2008-Effects of Stimulus Type and of Error-Correcting Code Design on BCI Speller Performance
19 0.34591961 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
20 0.33766854 46 nips-2008-Characterizing response behavior in multisensory perception with conflicting cues
topicId topicWeight
[(4, 0.018), (6, 0.03), (7, 0.039), (12, 0.024), (15, 0.496), (18, 0.014), (28, 0.133), (57, 0.034), (59, 0.021), (63, 0.024), (71, 0.025), (77, 0.039), (83, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.83023167 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
Author: Quentin J. Huys, Joshua Vogelstein, Peter Dayan
Abstract: Decision making lies at the very heart of many psychiatric diseases. It is also a central theoretical concern in a wide variety of fields and has undergone detailed, in-depth, analyses. We take as an example Major Depressive Disorder (MDD), applying insights from a Bayesian reinforcement learning framework. We focus on anhedonia and helplessness. Helplessness—a core element in the conceptualizations of MDD that has lead to major advances in its treatment, pharmacological and neurobiological understanding—is formalized as a simple prior over the outcome entropy of actions in uncertain environments. Anhedonia, which is an equally fundamental aspect of the disease, is related to the effective reward size. These formulations allow for the design of specific tasks to measure anhedonia and helplessness behaviorally. We show that these behavioral measures capture explicit, questionnaire-based cognitions. We also provide evidence that these tasks may allow classification of subjects into healthy and MDD groups based purely on a behavioural measure and avoiding any verbal reports. There are strong ties between decision making and psychiatry, with maladaptive decisions and behaviors being very prominent in people with psychiatric disorders. Depression is classically seen as following life events such as divorces and job losses. Longitudinal studies, however, have revealed that a significant fraction of the stressors associated with depression do in fact follow MDD onset, and that they are likely due to maladaptive behaviors prominent in MDD (Kendler et al., 1999). Clinically effective ’talking’ therapies for MDD such as cognitive and dialectical behavior therapies (DeRubeis et al., 1999; Bortolotti et al., 2008; Gotlib and Hammen, 2002; Power, 2005) explicitly concentrate on altering patients’ maladaptive behaviors and decision making processes. Decision making is a promising avenue into psychiatry for at least two more reasons. First, it offers powerful analytical tools. Control problems related to decision making are prevalent in a huge diversity of fields, ranging from ecology to economics, computer science and engineering. These fields have produced well-founded and thoroughly characterized frameworks within which many issues in decision making can be framed. Here, we will focus on framing issues identified in psychiatric settings within a normative decision making framework. Its second major strength comes from its relationship to neurobiology, and particularly those neuromodulatory systems which are powerfully affected by all major clinically effective pharmacotherapies in psychiatry. The understanding of these systems has benefited significantly from theoretical accounts of optimal control such as reinforcement learning (Montague et al., 1996; Kapur and Remington, 1996; Smith et al., 1999; Yu and Dayan, 2005; Dayan and Yu, 2006). Such accounts may be useful to identify in more specific terms the roles of the neuromodulators in psychiatry (Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008; Dayan and Huys, 2008). ∗ qhuys@cantab.net, joshuav@jhu.edu, dayan@gatsby.ucl.ac.uk; www.gatsby.ucl.ac.uk/∼qhuys/pub.html 1 Master Yoked Control Figure 1: The learned helplessness (LH) paradigm. Three sets of rats are used in a sequence of two tasks. In the first task, rats are exposed to escapable or inescapable shocks. Shocks come on at random times. The master rat is given escapable shocks: it can switch off the shock by performing an action, usually turning a wheel mounted in front of it. The yoked rat is exposed to precisely the same shocks as the master rat, i.e its shocks are terminated when the master rat terminates the shock. Thus its shocks are inescapable, there is nothing it can do itself to terminate them. A third set of rats is not exposed to shocks. Then, all three sets of rats are exposed to a shuttlebox escape task. Shocks again come on at random times, and rats have to shuttle to the other side of the box to terminate the shock. Only yoked rats fail to acquire the escape response. Yoked rats generally fail to acquire a wide variety of instrumental behaviours, either determined by reward or, as here, by punishment contingencies. This paper represents an initial attempt at validating this approach experimentally. We will frame core notions of MDD in a reinforcement learning framework and use it to design behavioral decision making experiments. More specifically, we will concentrate on two concepts central to current thinking about MDD: anhedonia and learned helplessness (LH, Maier and Seligman 1976; Maier and Watkins 2005). We formulate helplessness parametrically as prior beliefs on aspects of decision trees, and anhedonia as the effective reward size. This allows us to use choice behavior to infer the degree to which subjects’ behavioral choices are characterized by either of these. For validation, we correlate the parameters inferred from subjects’ behavior with standard, questionnaire-based measures of hopelessness and anhedonia, and finally use the inferred parameters alone to attempt to recover the diagnostic classification. 1 Core concepts: helplessness and anhedonia The basic LH paradigm is explained in figure 1. Its importance is manifold: the effect of inescapable shock on subsequent learning is sensitive to most classes of clinically effective antidepressants; it has arguably been a motivation framework for the development of the main talking therapies for depression (cognitive behavioural therapy, Williams (1992), it has motivated the development of further, yet more specific animal models (Willner, 1997), and it has been the basis of very specific research into the cognitive basis of depression (Peterson et al., 1993). Behavioral control is the central concept in LH: yoked and master rat do not differ in terms of the amount of shock (stress) they have experienced, only in terms of the behavioural control over it. It is not a standard notion in reinforcement learning, and there are several ways one could translate the concept into RL terms. At a simple level, there is intuitively more behavioural control if, when repeating one action, the same outcome occurs again and again, than if this were not true. Thus, at a very first level, control might be related to the outcome entropy of actions (see Maier and Seligman 1976 for an early formulation). Of course, this is too simple. If all available actions deterministically led to the same outcome, the agent has very little control. Finally, if one were able to achieve all outcomes except for the one one cares about (in the rats’ case switching off or avoiding the shock), we would again not say that there is much control (see Huys (2007); Huys and Dayan (2007) for a more detailed discussion). Despite its obvious limitations, we will here concentrate on the simplest notion for reasons of mathematical expediency. 2 0.6 0.5 Exploration vs Exploitation Predictive Distributions Q(aknown)−Q(aunknown) P(reward a known ) 0.7 2 0 1 2 3 4 5 0.4 0.3 0.2 Choose blue slot machine 0.5 0 −0.5 0.1 0 1 1 2 3 4 5 Reward −1 Choose orange slot machine 1 High control Low control 2 3 4 5 Tree depth Figure 2: Effect of γ on predictions, Q-values and exploration behaviour. Assume a slot machine (blue) has been chosen five times, with possible rewards 1-5, and that reward 2 has been obtained twice, and reward 4 three times (inset in left panel). Left: Predictive distribution for a prior with negative γ (low control) in light gray, and large γ (extensive control) in dark gray. We see that, if the agent believes he has much control (and outcome distributions have low entropy), the predictive distribution puts all mass on the observations. Right: Assume now the agent gets up to 5 more pulls (tree depth 1-5) between the blue slot machine and a new, orange slot machine. The orange slot machine’s predictive distribution is flat as it has never been tried, and its expected value is therefore 3. The plot shows the difference between the values for the two slot machines. First consider the agent only has one more pull to take. In this case, independently of the priors about control, the agent will choose the blue machine, because it is just slightly better than average. Note though that the difference is more pronounced if the agent has a high control prior. But things change if the agent has two or more choices. Now, it is worth trying out the new machine if the agent has a high-control prior. For in that case, if the new machine turns out to yield a large reward on the first try, it is likely to do so again for the second and subsequent times. Thus, the prior about control determines the exploration bonus. The second central concept in current conceptions of MDD is that of reward sensitivity. Anhedonia, an inability to enjoy previously enjoyable things, is one of two symptoms necessary for the diagnosis of depression (American Psychiatric Association, 1994). A number of tasks in the literature have attempted to measure reward sensitivity behaviourally. While these generally concur in finding decreased reward sensitivity in subjects with MDD, these results need further clarification. Some studies show interactions between reward and punishment sensitivities with respect to MDD, but important aspects of the tasks are not clearly understood. For instance, Henriques et al. (1994); Henriques and Davidson (2000) show decreased resonsiveness of MDD subjects to rewards, but equally show decreased resonsiveness of healthy subjects to punishments. Pizzagalli et al. (2005) introduced an asymmetrically rewarded perceptual discrimination task and show that the rate of change of the response bias is anticorrelated with subjects’ anhedonic symptoms. Exactly how decreased reward responsivity can account for this is at pressent not clear. Great care has to be taken to disentangle these two concepts. Anhedonia and helplessness both provide good reasons for not taking an action: either because the reinforcements associated with the action are insufficient (anhedonia), or because the outcome is not judged a likely result of taking some particular action (if actions are thought to have large outcome entropy). 2 A Bayesian formulation of control We consider a scenario where subjects have no knowledge of the outcome distributions of actions, but rather learn about them. This means that their prior beliefs about the outcome distributions are not overwhelmed by the likelihood of observations, and may thus have measurable effects on their action choices. In terms of RL, this means that agents do not know the decision tree of the problem they face. Control is formulated as a prior distribution on the outcome distributions, and thereby as a prior distribution on the decision trees. The concentration parameter α of a Dirichlet process can very simply parametrise entropy, and, if used as a prior, allow for very efficient updates of the predictive distributions of actions. Let us assume we have actions A which have as outcomes rewards R, and keep count Nt (r, a) = 3 k:k < 0. Here, we included a regressor for the AGE as that was a confounding variable in our subject sample. Furthermore, if it is true that anhedonia, as expressed by the questionnaire, relates to reward sensitivity specifically, we should be able to write a similar regression for the learning rate ǫ (from equation 5) ǫ(BDIa, AGE) = θǫ BDIa + cǫ AGE + ζǫ but find that θǫ is not different from zero. Figure 4 shows the ML values for the parameters of interest (emphasized in blue in the equations) and confirms that people who express higher levels of anhedonia do indeed show less reward sensitivity, but do not differ in terms of learning rate. If it were the case that subjects with higher BDIa score were just less attentive to the task, one might also expect an effect of BDIa on learning rate. 3.2 Control Validation: The control task is new, and we first need to ascertain that subjects were indeed sensitive to main features of the task. We thus fit both a RW-learning rule (as in the previous section, but adjusted for the varying number of available actions), and the full control model. Importantly, both these models have two parameters, but only the full control model has a notion of outcome entropy, and evaluations a tree. The chance probability of subjects’ actions was 0.37, meaning that, on average, there were just under three machines on the screen. The probability of the actions under the RW-learning rule was better at 0.48, and that of the full control model 0.54. These differences are highly significant as the total number of choices is 29600. Thus, we conclude that subjects were indeed sensitive to the manipulation of outcome entropy, and that they did look ahead in a tree. Prior belief about control: Applying the procedure from the previous task to the main task, we write the main parameters of equations 2 and 4 as functions of the questionnaire measures and infer linear parameters: γ1 (BDIa, BHS, age) = χγ1 BHS + θγ1 BDIa + cγ1 AGE + ζγ1 γ2 (BDIa, BHS, age) = χγ2 BHS + θγ2 BDIa + cγ2 AGE + ζγ2 β(BDIa, BHS, age) = χβ BHS + θβ BDIa + cβ AGE + ζβ Importantly, because the BDIa scores and the BHS scores are correlated in our sample (they tend to be large for the subjects with MDD), we include the cross-terms (θγ1 , θγ2 , χγ ), as we are interested in the specific effects of BDIa on β, as before, and of BHS on γ. 6 3 control γ 2 Figure 6: Classification. Controls are shown as black dots, and depressed subjects as red crosses. The blue line is a linear classifier. Thus, the patients and controls can be approximately classified purely on the basis of behaviour. 1 0 83% correct 69% sensitivity 94% specificity −1 −2 2 4 6 8 10 12 14 16 reward sensitivity β We here infer and display two separate values γ1 and γ2 . These correspond to the level of control in the first and the second half of the experiment. In fact, to parallel the LH experiments better, the slot machines in the first 50 rooms were actually very noisy (low true γ), which means that subjects were here exposed to low levels of control just like the yoked rats in the original experiment. In the second half of the experiment on the other hand, slot machines tended to be quite reliable (high true γ). Figure 5 shows again the ML values for the parameters of interest (emphasized in blue in the equations). Again, we find that our parameter estimate are very significantly different from zero (> three standard deviations). The effect of the BHS score on the prior beliefs about control γ is much stronger in the second half than of the experiment in the first half, i.e. the effect of BHS on the prior belief about control is particularly prominent when subjects are in a high-control environment and have previously been exposed to a low-control environment. This is an interesting parallel to the learned helplessness experiments in animals. 3.3 Classification Finally we combine the two tasks. We integrate out the learning rate ǫ, which we had found not be related to the questionnaire measures (c.f. figure 4), and use the distribution over β from the first task as a prior distribution on β for the second task. We also put weak priors on γ and infer both β and γ for the second task on a subject-by-subject basis. Figure 6 shows the posterior values for γ and β for MDD and healthy subjects and the ability of a linear classifier to classify them. 4 Discussion In this paper, we have attempted to provide a specific formulation of core psychiatric concepts in reinforcement learning terms, i.e. hopelessness as a prior belief about controllability, and anhedonia as reward sensitivity. We have briefly explained how we expect these formulations to have effect in a behavioural situation, have presented a behavioral task explicitly designed to be sensitive to our formulations, and shown that people’s verbal expression of hopelessness and anhedonia do have specific behavioral impacts. Subjects who express anhedonia display insensitivity to rewards and those expressing hopelessness behave as if they had prior beliefs that outcome distributions of actions (slot machines) are very broad. Finally, we have shown that these purely behavioural measures are also predictive of their psychiatric status, in that we were able to classify patients and healthy controls purely on the basis of performance. Several aspects of this work are novel. There have been previous attempts to map aspects of psychiatric dysfunction onto specific parametrizations (Cohen et al., 1996; Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008), but we believe that our work represents the first attempt to a) apply it to MDD; b) make formal predictions about subject behavior c) present strong evidence linking anhedonia specifically to reward insensitivity across two tasks d) combine tasks to tease helplessness and anhedonia apart and e) to use the behavioral inferences for classification. The latter point is particularly important, as it will determine any potential clinical significance (Veiel, 1997). In the future, rather than cross-validating with respect to say DSM-IV criteria, it may also be important to validate measures such as ours in their own right in longitudinal studies. 7 Several important caveats do remain. First, the populations are not fully matched for age. We included age as an additional regressor and found all results to be robust. Secondly, only the healthy subjects were remunerated. However, repeating the analyses presented using only the MDD subjects yields the same results (data not shown). Thirdly, we have not yet fully mirrored the LH experiments. We have so far only tested the transfer from a low-control environment to a high-control environment. To make statements like those in animal learned helplessness experiments, the transfer from high-control to low-control environments will need to be examined, too. Fourth, the notion of control we have used is very simple, and more complex notions should certainly be tested (see Dayan and Huys 2008). Fifth, and maybe most importantly, we have so far only attempted to classify MDD and healthy subjects, and can thus not yet make any statements about the specificity of these effects with respect to MDD. Finally, it will be important to replicate these results independently, and possibly in a different modality. Nevertheless, we believe these results to be very encouraging. Acknowledgments: This work would not have been possible without the help of Sarah Hollingsworth Lisanby, Kenneth Miller and Ramin V. Parsey. We would also like to thank Nathaniel Daw and Hanneke EM Den Ouden and Ren´ Hen for invaluable discussions. Support for this work was provided by the Gatsby Charitable e Foundation (PD), a UCL Bogue Fellowship and the Swartz Foundation (QH) and a Columbia University startup grant to Kenneth Miller. References American Psychiatric Association (1994). Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Association Press. Bortolotti, B., Menchetti, M., Bellini, F., Montaguti, M. B., and Berardi, D. (2008). Psychological interventions for major depression in primary care: a meta-analytic review of randomized controlled trials. Gen Hosp Psychiatry, 30(4):293–302. Cohen, J. D., Braver, T. S., and O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control and schizophrenia: recent developments and current challenges. Philos Trans R Soc Lond B Biol Sci, 351(1346):1515–1527. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., and Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879. Dayan, P. and Huys, Q. J. M. (2008). Serotonin, inhibition, and negative mood. PLoS Comput Biol, 4(2):e4. Dayan, P. and Yu, A. J. (2006). Phasic norepinephrine: a neural interrupt signal for unexpected events. Network, 17(4):335– 350. DeRubeis, R. J., Gelfand, L. A., Tang, T. Z., and Simons, A. D. (1999). Medications versus cognitive behavior therapy for severely depressed outpatients: mega-analysis of four randomized comparisons. Am J Psychiatry, 156(7):1007–1013. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002a). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Non-Patient Edition. (SCID-I/NP). Biometrics Research, New York State Psychiatric Institute. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002b). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Patient Edition. (SCID-I/P). Biometrics Research, New York State Psychiatric Institute. Gotlib, I. H. and Hammen, C. L., editors (2002). Handbook of Depression. The Guilford Press. Henriques, J. B. and Davidson, R. J. (2000). Decreased responsiveness to reward in depression. Cognition and Emotion, 14(5):711–24. Henriques, J. B., Glowacki, J. M., and Davidson, R. J. (1994). Reward fails to alter response bias in depression. J Abnorm Psychol, 103(3):460–6. Huys, Q. J. M. (2007). Reinforcers and control. Towards a computational ætiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of London. Huys, Q. J. M. and Dayan, P. (2007). A bayesian formulation of behavioral control. Under Review, 0:00. Kapur, S. and Remington, G. (1996). Serotonin-dopamine interaction and its relevance to schizophrenia. Am J Psychiatry, 153(4):466–76. Kendler, K. S., Karkowski, L. M., and Prescott, C. A. (1999). Causal relationship between stressful life events and the onset of major depression. Am. J. Psychiatry, 156:837–41. Maier, S. and Seligman, M. (1976). Learned Helplessness: Theory and Evidence. Journal of Experimental Psychology: General, 105(1):3–46. Maier, S. F. and Watkins, L. R. (2005). Stressor controllability and learned helplessness: the roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci. Biobehav. Rev., 29(4-5):829–41. Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci., 16(5):1936–47. Moutoussis, M., Bentall, R. P., Williams, J., and Dayan, P. (2008). A temporal difference account of avoidance learning. Network, 19(2):137–160. Peterson, C., Maier, S. F., and Seligman, M. E. P. (1993). Learned Helplessness: A theory for the age of personal control. OUP, Oxford, UK. Pizzagalli, D. A., Jahn, A. L., and O’Shea, J. P. (2005). Toward an objective characterization of an anhedonic phenotype: a signal-detection approach. Biol Psychiatry, 57(4):319–327. Power, M., editor (2005). Mood Disorders: A Handbook of Science and Practice. John Wiley and Sons, paperback edition. Smith, A., Li, M., Becker, S., and Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: a computational approach. Neuropsychopharm., 29(6):1040–9. Smith, K. A., Morris, J. S., Friston, K. J., Cowen, P. J., and Dolan, R. J. (1999). Brain mechanisms associated with depressive relapse and associated cognitive impairment following acute tryptophan depletion. Br. J. Psychiatry, 174:525–9. Veiel, H. O. F. (1997). A preliminary profile of neuropsychological deficits associated with major depression. J. Clin. Exp. Neuropsychol., 19:587–603. Williams, J. and Dayan, P. (2005). Dopamine, learning, and impulsivity: a biological account of attentiondeficit/hyperactivity disorder. J Child Adolesc Psychopharmacol, 15(2):160–79; discussion 157–9. Williams, J. M. G. (1992). The psychological treatment of depression. Routledge. Willner, P. (1997). Validity, reliability and utility of the chronic mild stress model of depression: a 10-year review and evaluation. Psychopharm, 134:319–29. Yu, A. J. and Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4):681–692. 8
2 0.61828136 90 nips-2008-Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity
Author: Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani
Abstract: We consider the problem of extracting smooth, low-dimensional neural trajectories that summarize the activity recorded simultaneously from tens to hundreds of neurons on individual experimental trials. Current methods for extracting neural trajectories involve a two-stage process: the data are first “denoised” by smoothing over time, then a static dimensionality reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way, and account for spiking variability that may vary both across neurons and across time. We then present a novel method for extracting neural trajectories, Gaussian-process factor analysis (GPFA), which unifies the smoothing and dimensionality reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that GPFA provided a better characterization of the population activity than the two-stage methods. 1
3 0.54453403 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
Author: Laurent Jacob, Jean-philippe Vert, Francis R. Bach
Abstract: In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the IEDB MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non-convex methods dedicated to the same problem. 1
4 0.52512538 224 nips-2008-Structured ranking learning using cumulative distribution networks
Author: Jim C. Huang, Brendan J. Frey
Abstract: Ranking is at the heart of many information retrieval applications. Unlike standard regression or classification in which we predict outputs independently, in ranking we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. We present a probabilistic method for learning to rank using the graphical modelling framework of cumulative distribution networks (CDNs), where we can take into account the structure inherent to the problem of ranking by modelling the joint cumulative distribution functions (CDFs) over multiple pairwise preferences. We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for learning to rank. 1
5 0.34661648 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
Author: Matthew Botvinick, James An
Abstract: Research in animal learning and behavioral neuroscience has distinguished between two forms of action control: a habit-based form, which relies on stored action values, and a goal-directed form, which forecasts and compares action outcomes based on a model of the environment. While habit-based control has been the subject of extensive computational research, the computational principles underlying goal-directed control in animals have so far received less attention. In the present paper, we advance a computational framework for goal-directed control in animals and humans. We take three empirically motivated points as founding premises: (1) Neurons in dorsolateral prefrontal cortex represent action policies, (2) Neurons in orbitofrontal cortex represent rewards, and (3) Neural computation, across domains, can be appropriately understood as performing structured probabilistic inference. On a purely computational level, the resulting account relates closely to previous work using Bayesian inference to solve Markov decision problems, but extends this work by introducing a new algorithm, which provably converges on optimal plans. On a cognitive and neuroscientific level, the theory provides a unifying framework for several different forms of goal-directed action selection, placing emphasis on a novel form, within which orbitofrontal reward representations directly drive policy selection. 1 G oal- d irect ed act i on cont rol In the study of human and animal behavior, it is a long-standing idea that reward-based decision making may rely on two qualitatively different mechanisms. In habit-based decision making, stimuli elicit reflex-like responses, shaped by past reinforcement [1]. In goal-directed or purposive decision making, on the other hand, actions are selected based on a prospective consideration of possible outcomes and future lines of action [2]. Over the past twenty years or so, the attention of cognitive neuroscientists and computationally minded psychologists has tended to focus on habit-based control, due in large part to interest in potential links between dopaminergic function and temporal-difference algorithms for reinforcement learning. However, a resurgence of interest in purposive action selection is now being driven by innovations in animal behavior research, which have yielded powerful new behavioral assays [3], and revealed specific effects of focal neural damage on goaldirected behavior [4]. In discussing some of the relevant data, Daw, Niv and Dayan [5] recently pointed out the close relationship between purposive decision making, as understood in the behavioral sciences, and model-based methods for the solution of Markov decision problems (MDPs), where action policies are derived from a joint analysis of a transition function (a mapping from states and actions to outcomes) and a reward function (a mapping from states to rewards). Beyond this important insight, little work has yet been done to characterize the computations underlying goal-directed action selection (though see [6, 7]). As discussed below, a great deal of evidence indicates that purposive action selection depends critically on a particular region of the brain, the prefrontal cortex. However, it is currently a critical, and quite open, question what the relevant computations within this part of the brain might be. Of course, the basic computational problem of formulating an optimal policy given a model of an MDP has been extensively studied, and there is no shortage of algorithms one might consider as potentially relevant to prefrontal function (e.g., value iteration, policy iteration, backward induction, linear programming, and others). However, from a cognitive and neuroscientific perspective, there is one approach to solving MDPs that it seems particularly appealing to consider. In particular, several researchers have suggested methods for solving MDPs through probabilistic inference [8-12]. The interest of this idea, in the present context, derives from a recent movement toward framing human and animal information processing, as well as the underlying neural computations, in terms of structured probabilistic inference [13, 14]. Given this perspective, it is inviting to consider whether goal-directed action selection, and the neural mechanisms that underlie it, might be understood in those same terms. One challenge in investigating this possibility is that previous research furnishes no ‘off-theshelf’ algorithm for solving MDPs through probabilistic inference that both provably yields optimal policies and aligns with what is known about action selection in the brain. We endeavor here to start filling in that gap. In the following section, we introduce an account of how goal-directed action selection can be performed based on probabilisitic inference, within a network whose components map grossly onto specific brain structures. As part of this account, we introduce a new algorithm for solving MDPs through Bayesian inference, along with a convergence proof. We then present results from a set of simulations illustrating how the framework would account for a variety of behavioral phenomena that are thought to involve purposive action selection. 2 Co m p u t a t i o n a l m o d el As noted earlier, the prefrontal cortex (PFC) is believed to play a pivotal role in purposive behavior. This is indicated by a broad association between prefrontal lesions and impairments in goal-directed action in both humans (see [15]) and animals [4]. Single-unit recording and other data suggest that different sectors of PFC make distinct contributions. In particular, neurons in dorsolateral prefrontal cortex (DLPFC) appear to encode taskspecific mappings from stimuli to responses (e.g., [16]): “task representations,” in the language of psychology, or “policies” in the language of dynamic programming. Although there is some understanding of how policy representations in DLPFC may guide action execution [15], little is yet known about how these representations are themselves selected. Our most basic proposal is that DLPFC policy representations are selected in a prospective, model-based fashion, leveraging information about action-outcome contingencies (i.e., the transition function) and about the incentive value associated with specific outcomes or states (the reward function). There is extensive evidence to suggest that state-reward associations are represented in another area of the PFC, the orbitofrontal cortex (OFC) [17, 18]. As for the transition function, although it is clear that the brain contains detailed representations of action-outcome associations [19], their anatomical localization is not yet entirely clear. However, some evidence suggests that the enviromental effects of simple actions may be represented in inferior fronto-parietal cortex [20], and there is also evidence suggesting that medial temporal structures may be important in forecasting action outcomes [21]. As detailed in the next section, our model assumes that policy representations in DLPFC, reward representations in OFC, and representations of states and actions in other brain regions, are coordinated within a network structure that represents their causal or statistical interdependencies, and that policy selection occurs, within this network, through a process of probabilistic inference. 2.1 A rc h i t e c t u re The implementation takes the form of a directed graphical model [22], with the layout shown in Figure 1. Each node represents a discrete random variable. State variables (s), representing the set of m possible world states, serve the role played by parietal and medial temporal cortices in representing action outcomes. Action variables (a) representing the set of available actions, play the role of high-level cortical motor areas involved in the programming of action sequences. Policy variables ( ), each repre-senting the set of all deterministic policies associated with a specific state, capture the representational role of DLPFC. Local and global utility variables, described further Fig 1. Left: Single-step decision. Right: Sequential decision. below, capture the role of OFC in Each time-slice includes a set of m policy nodes. representing incentive value. A separate set of nodes is included for each discrete time-step up to the planning horizon. The conditional probabilities associated with each variable are represented in tabular form. State probabilities are based on the state and action variables in the preceding time-step, and thus encode the transition function. Action probabilities depend on the current state and its associated policy variable. Utilities depend only on the current state. Rather than representing reward magnitude as a continuous variable, we adopt an approach introduced by [23], representing reward through the posterior probability of a binary variable (u). States associated with large positive reward raise p(u) (i.e, p(u=1|s)) near to one; states associated with large negative rewards reduce p(u) to near zero. In the simulations reported below, we used a simple linear transformation to map from scalar reward values to p(u): p (u si ) = 1 R ( si ) +1 , rmax 2 rmax max j R ( s j ) (1) In situations involving sequential actions, expected returns from different time-steps must be integrated into a global representation of expected value. In order to accomplish this, we employ a technique proposed by [8], introducing a “global” utility variable (u G). Like u, this 1 is a binary random variable, but associated with a posterior probability determined as: p (uG ) = 1 N p(u i ) (2) i where N is the number of u nodes. The network as whole embodies a generative model for instrumental action. The basic idea is to use this model as a substrate for probabilistic inference, in order to arrive at optimal policies. There are three general methods for accomplishing this, which correspond three forms of query. First, a desired outcome state can be identified, by treating one of the state variables (as well as the initial state variable) as observed (see [9] for an application of this approach). Second, the expected return for specific plans can be evaluated and compared by conditioning on specific sets of values over the policy nodes (see [5, 21]). However, our focus here is on a less obvious possibility, which is to condition directly on the utility variable u G , as explained next. 2.2 P o l i c y s e l e c t i o n b y p ro b a b i l i s t i c i n f e re n c e : a n i t e r a t i v e a l g o r i t h m Cooper [23] introduced the idea of inferring optimal decisions in influence diagrams by treating utility nodes into binary random variables and then conditioning on these variables. Although this technique has been adopted in some more recent work [9, 12], we are aware of no application that guarantees optimal decisions, in the expected-reward sense, in multi-step tasks. We introduce here a simple algorithm that does furnish such a guarantee. The procedure is as follows: (1) Initialize the policy nodes with any set of non-deterministic 2 priors. (2) Treating the initial state and u G as observed variables (u G = 1), use standard belief 1 Note that temporal discounting can be incorporated into the framework through minimal modifications to Equation 2. 2 In the single-action situation, where there is only one u node, it is this variable that is treated as observed (u = 1). propagation (or a comparable algorithm) to infer the posterior distributions over all policy nodes. (3) Set the prior distributions over the policy nodes to the values (posteriors) obtained in step 2. (4) Go to step 2. The next two sections present proofs of monotonicity and convergence for this algorithm. 2.2.1 Monotonicity We show first that, at each policy node, the probability associated with the optimal policy will rise on every iteration. Define * as follows: ( * p uG , + ) > p (u + , G ), * (3) where + is the current set of probability distributions at all policy nodes on subsequent time-steps. (Note that we assume here, for simplicity, that there is a unique optimal policy.) The objective is to establish that: p ( t* ) > p ( t* 1 ) (4) where t indexes processing iterations. The dynamics of the network entail that p( ) = p( t t 1 uG ) (5) where represents any value (i.e., policy) of the decision node being considered. Substituting this into (4) gives p t* 1 uG > p ( t* 1 ) (6) ( ) From this point on the focus is on a single iteration, which permits us to omit the relevant subscripts. Applying Bayes’ law to (6) yields p (uG * p (uG ) p( ) > p * )p ( ) ( ) * (7) Canceling, and bringing the denominator up, this becomes p (uG * )> p (uG ) p( ) (8) Rewriting the left hand side, we obtain p ( uG * ) p( ) > p (uG ) p( ) (9) Subtracting and further rearranging: p (uG p (uG * ) p ( uG * * * ) p (uG ) p( ) + p (uG * * ) * p ( uG ) p( ) > 0 p (uG * ) p (uG ) p( ) > 0 (10) ) p( ) > 0 (11) (12) Note that this last inequality (12) follows from the definition of *. Remark: Of course, the identity of * depends on +. In particular, the policy * will only be part of a globally optimal plan if the set of choices + is optimal. Fortunately, this requirement is guaranteed to be met, as long as no upper bound is placed on the number of processing cycles. Recalling that we are considering only finite-horizon problems, note that for policies leading to states with no successors, + is empty. Thus * at the relevant policy nodes is fixed, and is guaranteed to be part of the optimal policy. The proof above shows that * will continuously rise. Once it reaches a maximum, * at immediately preceding decisions will perforce fit with the globally optimal policy. The process works backward, in the fashion of backward induction. 2.2.2 Convergence Continuing with the same notation, we show now that pt ( limt uG ) = 1 * (13) Note that, if we apply Bayes’ law recursively, pt ( uG ) = ( ) p ( ) = p (u p uG t ) G pi (uG ) 2 pt pi (uG ) pt 1 ( ( )= ) p uG 1 ( uG ) pt (uG ) pt 3 pt 2 ( ) 1 ( u G ) pt 2 ( u G ) … (14) Thus, p1 ( uG ) = ( p uG ) p ( ), p ( p (u ) 1 uG ) = 2 1 G 2 ( ) p ( ), p uG 1 p2 (uG ) p1 (uG ) p3 ( 3 ( ) p( ) p uG uG ) = 1 p3 (uG ) p2 (uG ) p1 (uG ) , (15) and so forth. Thus, what we wish to prove is ( * p uG ) p ( ) =1 * 1 (16) pt (uG ) t =1 or, rearranging, pt (uG ) ( = p1 ( ) p uG t =1 (17) ). Note that, given the stipulated relationship between p( ) on each processing iteration and p( | uG) on the previous iteration, p (uG pt (uG ) = )p ( ) = p ( uG = pt 1 )p ( p (uG t uG ) = t 1 3 )p ( ) 4 p (uG t 1 = (uG ) pt 2 (uG ) pt 1 ) pt 2 p (uG )p ( ) t 1 pt 1 1 ( ) (uG ) pt 2 (uG ) pt 3 (uG ) ( uG ) (18) … With this in mind, we can rewrite the left hand side product in (17) as follows: p ( uG p1 (uG ) ( p uG ) p (u G 2 )p( ) ) p (u 1 G ) 3 p (uG 1 ( p uG )p( ) ) p (u 1 G 4 p (uG 1 ( ) p2 (uG ) p uG ) p (u 1 ) p( ) 1 G ) p2 (uG ) p3 (uG ) … (19) Note that, given (18), the numerator in each factor of (19) cancels with the denominator in the subsequent factor, leaving only p(uG| *) in that denominator. The expression can thus be rewritten as 1 ( p uG 1 ) p (u G ) p (u G 4 p (uG 1 ) ) p( ) 1 ( p uG ) … = p (uG ( p uG ) ) p1 ( ). (20) The objective is then to show that the above equals p( *). It proceeds directly from the definition of * that, for all other than *, p ( uG ( p uG ) ) <1 (21) Thus, all but one of the terms in the sum above approach zero, and the remaining term equals p1( *). Thus, p (uG ( p uG ) ) p1 ( ) = p1 ( ) (22) 3 Simulations 3.1 Binary choice We begin with a simulation of a simple incentive choice situation. Here, an animal faces two levers. Pressing the left lever reliably yields a preferred food (r = 2), the right a less preferred food (r = 1). Representing these contingencies in a network structured as in Fig. 1 (left) and employing the iterative algorithm described in section 2.2 yields the results in Figure 2A. Shown here are the posterior probabilities for the policies press left and press right, along with the marginal value of p(u = 1) under these posteriors (labeled EV for expected value). The dashed horizontal line indicates the expected value for the optimal plan, to which the model obviously converges. A key empirical assay for purposive behavior involves outcome devaluation. Here, actions yielding a previously valued outcome are abandoned after the incentive value of the outcome is reduced, for example by pairing with an aversive event (e.g., [4]). To simulate this within the binary choice scenario just described, we reduced to zero the reward value of the food yielded by the left lever (fL), by making the appropriate change to p(u|fL). This yielded a reversal in lever choice (Fig. 2B). Another signature of purposive actions is that they are abandoned when their causal connection with rewarding outcomes is removed (contingency degradation, see [4]). We simulated this by starting with the model from Fig. 2A and changing conditional probabilities at s for t=2 to reflect a decoupling of the left action from the fL outcome. The resulting behavior is shown in Fig. 2C. Fig 2. Simulation results, binary choice. 3.2 Stochastic outcomes A critical aspect of the present modeling paradigm is that it yields reward-maximizing choices in stochastic domains, a property that distinguishes it from some other recent approaches using graphical models to do planning (e.g., [9]). To illustrate, we used the architecture in Figure 1 (left) to simulate a choice between two fair coins. A ‘left’ coin yields $1 for heads, $0 for tails; a ‘right’ coin $2 for heads but for tails a $3 loss. As illustrated in Fig. 2D, the model maximizes expected value by opting for the left coin. Fig 3. Simulation results, two-step sequential choice. 3.3 Sequential decision Here, we adopt the two-step T-maze scenario used by [24] (Fig. 3A). Representing the task contingencies in a graphical model based on the template from Fig 1 (right), and using the reward values indicated in Fig. 3A, yields the choice behavior shown in Figure 3B. Following [24], a shift in motivational state from hunger to thirst can be represented in the graphical model by changing the reward function (R(cheese) = 2, R(X) = 0, R(water) = 4, R(carrots) = 1). Imposing this change at the level of the u variables yields the choice behavior shown in Fig. 3C. The model can also be used to simulate effort-based decision. Starting with the scenario in Fig. 2A, we simulated the insertion of an effort-demanding scalable barrier at S 2 (R(S 2 ) = -2) by making appropriate changes p(u|s). The resulting behavior is shown in Fig. 3D. A famous empirical demonstration of purposive control involves detour behavior. Using a maze like the one shown in Fig. 4A, with a food reward placed at s5 , Tolman [2] found that rats reacted to a barrier at location A by taking the upper route, but to a barrier at B by taking the longer lower route. We simulated this experiment by representing the corresponding 3 transition and reward functions in a graphical model of the form shown in Fig. 1 (right), representing the insertion of barriers by appropriate changes to the transition function. The resulting choice behavior at the critical juncture s2 is shown in Fig. 4. Fig 4. Simulation results, detour behavior. B: No barrier. C: Barrier at A. D: Barrier at B. Another classic empirical demonstration involves latent learning. Blodgett [25] allowed rats to explore the maze shown in Fig. 5. Later insertion of a food reward at s13 was followed immediately by dramatic reductions in the running time, reflecting a reduction in entries into blind alleys. We simulated this effect in a model based on the template in Fig. 1 (right), representing the maze layout via an appropriate transition function. In the absence of a reward at s12 , random choices occurred at each intersection. However, setting R(s13 ) = 1 resulted in the set of choices indicated by the heavier arrows in Fig. 5. 4 Fig 5. Latent learning. Rel a t i o n t o p revi o u s work Initial proposals for how to solve decision problems through probabilistic inference in graphical models, including the idea of encoding reward as the posterior probability of a random utility variable, were put forth by Cooper [23]. Related ideas were presented by Shachter and Peot [12], including the use of nodes that integrate information from multiple utility nodes. More recently, Attias [11] and Verma and Rao [9] have used graphical models to solve shortest-path problems, leveraging probabilistic representations of rewards, though not in a way that guaranteed convergence on optimal (reward maximizing) plans. More closely related to the present research is work by Toussaint and Storkey [10], employing the EM algorithm. The iterative approach we have introduced here has a certain resemblance to the EM procedure, which becomes evident if one views the policy variables in our models as parameters on the mapping from states to actions. It seems possible that there may be a formal equivalence between the algorithm we have proposed and the one reported by [10]. As a cognitive and neuroscientific proposal, the present work bears a close relation to recent work by Hasselmo [6], addressing the prefrontal computations underlying goal-directed action selection (see also [7]). The present efforts are tied more closely to normative principles of decision-making, whereas the work in [6] is tied more closely to the details of neural circuitry. In this respect, the two approaches may prove complementary, and it will be interesting to further consider their interrelations. 3 In this simulation and the next, the set of states associated with each state node was limited to the set of reachable states for the relevant time-step, assuming an initial state of s1 . Acknowledgments Thanks to Andrew Ledvina, David Blei, Yael Niv, Nathaniel Daw, and Francisco Pereira for useful comments. R e f e re n c e s [1] Hull, C.L., Principles of Behavior. 1943, New York: Appleton-Century. [2] Tolman, E.C., Purposive Behavior in Animals and Men. 1932, New York: Century. [3] Dickinson, A., Actions and habits: the development of behavioral autonomy. Philosophical Transactions of the Royal Society (London), Series B, 1985. 308: p. 67-78. [4] Balleine, B.W. and A. Dickinson, Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology, 1998. 37: p. 407-419. [5] Daw, N.D., Y. Niv, and P. Dayan, Uncertainty-based competition between prefrontal and striatal systems for behavioral control. Nature Neuroscience, 2005. 8: p. 1704-1711. [6] Hasselmo, M.E., A model of prefrontal cortical mechanisms for goal-directed behavior. Journal of Cognitive Neuroscience, 2005. 17: p. 1115-1129. [7] Schmajuk, N.A. and A.D. Thieme, Purposive behavior and cognitive mapping. A neural network model. Biological Cybernetics, 1992. 67: p. 165-174. [8] Tatman, J.A. and R.D. Shachter, Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man and Cybernetics, 1990. 20: p. 365-379. [9] Verma, D. and R.P.N. Rao. Planning and acting in uncertain enviroments using probabilistic inference. in IEEE/RSJ International Conference on Intelligent Robots and Systems. 2006. [10] Toussaint, M. and A. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. in Proceedings of the 23rd International Conference on Machine Learning. 2006. Pittsburgh, PA. [11] Attias, H. Planning by probabilistic inference. in Proceedings of the 9th Int. Workshop on Artificial Intelligence and Statistics. 2003. [12] Shachter, R.D. and M.A. Peot. Decision making using probabilistic inference methods. in Uncertainty in artificial intelligence: Proceedings of the Eighth Conference (1992). 1992. Stanford University: M. Kaufmann. [13] Chater, N., J.B. Tenenbaum, and A. Yuille, Probabilistic models of cognition: conceptual foundations. Trends in Cognitive Sciences, 2006. 10(7): p. 287-291. [14] Doya, K., et al., eds. The Bayesian Brain: Probabilistic Approaches to Neural Coding. 2006, MIT Press: Cambridge, MA. [15] Miller, E.K. and J.D. Cohen, An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 2001. 24: p. 167-202. [16] Asaad, W.F., G. Rainer, and E.K. Miller, Task-specific neural activity in the primate prefrontal cortex. Journal of Neurophysiology, 2000. 84: p. 451-459. [17] Rolls, E.T., The functions of the orbitofrontal cortex. Brain and Cognition, 2004. 55: p. 11-29. [18] Padoa-Schioppa, C. and J.A. Assad, Neurons in the orbitofrontal cortex encode economic value. Nature, 2006. 441: p. 223-226. [19] Gopnik, A., et al., A theory of causal learning in children: causal maps and Bayes nets. Psychological Review, 2004. 111: p. 1-31. [20] Hamilton, A.F.d.C. and S.T. Grafton, Action outcomes are represented in human inferior frontoparietal cortex. Cerebral Cortex, 2008. 18: p. 1160-1168. [21] Johnson, A., M.A.A. van der Meer, and D.A. Redish, Integrating hippocampus and striatum in decision-making. Current Opinion in Neurobiology, 2008. 17: p. 692-697. [22] Jensen, F.V., Bayesian Networks and Decision Graphs. 2001, New York: Springer Verlag. [23] Cooper, G.F. A method for using belief networks as influence diagrams. in Fourth Workshop on Uncertainty in Artificial Intelligence. 1988. University of Minnesota, Minneapolis. [24] Niv, Y., D. Joel, and P. Dayan, A normative perspective on motivation. Trends in Cognitive Sciences, 2006. 10: p. 375-381. [25] Blodgett, H.C., The effect of the introduction of reward upon the maze performance of rats. University of California Publications in Psychology, 1929. 4: p. 113-134.
6 0.34416208 231 nips-2008-Temporal Dynamics of Cognitive Control
7 0.34233409 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
8 0.34195605 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints
9 0.34077114 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring
10 0.33816409 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
11 0.3350206 66 nips-2008-Dynamic visual attention: searching for coding length increments
12 0.33447039 10 nips-2008-A rational model of preference learning and choice prediction by children
13 0.32870227 244 nips-2008-Unifying the Sensory and Motor Components of Sensorimotor Adaptation
14 0.32846028 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
15 0.32783857 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences
16 0.32488352 59 nips-2008-Dependent Dirichlet Process Spike Sorting
17 0.32428575 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
18 0.32282761 159 nips-2008-On Bootstrapping the ROC Curve
19 0.32168815 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries
20 0.32151347 134 nips-2008-Mixed Membership Stochastic Blockmodels