nips nips2004 nips2004-84 knowledge-graph by maker-knowledge-mining

84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture


Source: pdf

Author: Angela J. Yu, Peter Dayan

Abstract: We study the synthesis of neural coding, selective attention and perceptual decision making. A hierarchical neural architecture is proposed, which implements Bayesian integration of noisy sensory input and topdown attentional priors, leading to sound perceptual discrimination. The model offers an explicit explanation for the experimentally observed modulation that prior information in one stimulus feature (location) can have on an independent feature (orientation). The network’s intermediate levels of representation instantiate known physiological properties of visual cortical neurons. The model also illustrates a possible reconciliation of cortical and neuromodulatory representations of uncertainty. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We study the synthesis of neural coding, selective attention and perceptual decision making. [sent-9, score-0.284]

2 A hierarchical neural architecture is proposed, which implements Bayesian integration of noisy sensory input and topdown attentional priors, leading to sound perceptual discrimination. [sent-10, score-0.472]

3 The model offers an explicit explanation for the experimentally observed modulation that prior information in one stimulus feature (location) can have on an independent feature (orientation). [sent-11, score-0.289]

4 The network’s intermediate levels of representation instantiate known physiological properties of visual cortical neurons. [sent-12, score-0.255]

5 The model also illustrates a possible reconciliation of cortical and neuromodulatory representations of uncertainty. [sent-13, score-0.203]

6 1 Introduction A constant stream of noisy and ambiguous sensory inputs bombards our brains, informing on-going inferential processes and directing perceptual decision-making. [sent-14, score-0.223]

7 Neurophysiologists and psychologists have long studied inference and decision-making in isolation, as well as the careful attentional filtering that is necessary to optimize them. [sent-15, score-0.229]

8 In this paper, we study an attentional task which involves all three components, and thereby directly confront their interaction. [sent-17, score-0.196]

9 One is microscopic, for which individual cortical neurons and populations either implicitly or explicitly represent the uncertainty. [sent-21, score-0.2]

10 The other family is macroscopic, with cholinergic (ACh) and noradrenergic (NE) neuromodulatory systems reporting computationally distinct forms of uncertainty to influence the way that information in differentially reliable cortical areas is integrated and learned [6, 7]. [sent-23, score-0.311]

11 The second element is selective attention and top-down influences over sensory processing. [sent-25, score-0.262]

12 Here, the key challenge is to couple the many ideas about the way that attention should, from a sound statistical viewpoint, modify sensory processing, to the measurable effects of attention on the neural substrate. [sent-26, score-0.416]

13 For instance, one typical consequence of (visual) featural and spatial attention is an increase in the activities of neurons in cortical populations repre- senting those features, which is equivalent to multiplying their tuning functions by a factor [8]. [sent-27, score-0.806]

14 The third element is the coupling between sensory processing and perceptual decisions. [sent-29, score-0.185]

15 In order to explore the interaction of these elements, we model an extensively studied attentional task (due to Posner [15]), in which probabilistic spatial cueing is used to manipulate attentional modulation of visual discrimination. [sent-31, score-0.767]

16 We employ a hierarchical neural architecture in which top-down attentional priors are integrated with sequentially sampled sensory input in a sound Bayesian manner, using a logarithmic mapping between cortical neural activities and uncertainty [4]. [sent-32, score-0.799]

17 In the model, the information provided by the cue is realized as a change in the prior distribution over the cued dimension (space). [sent-33, score-0.394]

18 The effect of the prior is to eliminate inputs from spatial locations considered irrelevant for the task, thus improving discrimination in another dimension (orientation). [sent-34, score-0.276]

19 In section 4, we compare the perceptual performance of the network to psychophysics data, and the intermediate layers’ activities to the relevant physiological data. [sent-37, score-0.375]

20 2 Spatial Attention as Prior Information In the classic version of Posner’s task [15], a subject is presented with a cue that predicts the location of a subsequent target with a certain probability termed its validity. [sent-38, score-0.189]

21 The cue is valid if it makes a correct prediction, and invalid otherwise. [sent-39, score-0.672]

22 Subjects typically perform detection or discrimination on the target more rapidly and accurately on a valid-cue trial than an invalid one, reflecting cue-induced attentional modulation of visual processing and/or decision making [15]. [sent-40, score-0.833]

23 This difference in reaction time or accuracy is often termed the validity effect [16], and depends on the cue validity [17]. [sent-41, score-0.47]

24 The cue induces a top-down spatial prior, which we model as a mixture of a component sharply peaked at the cued location and a broader component capturing contextual and bottom-up saliency factors (including the possibility of invalidity). [sent-43, score-0.451]

25 5 3 4 4 rj (t) = rj (t − 1) + rj (t) + ct 5 µ φ 0 1 2 3 φ 4 5 6 Layer III 4 3 rj (t) = log 2 φ 0 1 2 3 φ 4 5 2 i exp(rij (t)) + bt 6 Layer II 15 φ 5 6 µ fij (µ∗ , φ∗ ) 1 2 rij (t) = rij (t) + log P (µi ) + at 10 4 0 1. [sent-58, score-1.815]

26 5 φ Layer I 3 1 rij (t) = log p(xt |µi , φj ) 10 8 2 6 φ 1 6 φ 4 0 1. [sent-62, score-0.245]

27 Layer I activities represent the log likelihood of the data given each possible setting of µi and φj . [sent-71, score-0.29]

28 In layer II, the log likelihood of each µ i and φj is modulated by the prior information log P (µj ), shown on the upper left. [sent-73, score-0.489]

29 The prior in µ strongly suppresses the noisy input in the irrelevant part of the µ dimension, thus enabling improved inference based on the underlying tuning response fij . [sent-74, score-0.257]

30 The layer III neurons represent the log marginal posterior of φ by integrating out the µ dimension of layer II activities. [sent-75, score-0.93]

31 Layer IV neurons combine recurrent information and feedforward input from layer III to compute the log marginal posterior given all data so far observed. [sent-76, score-0.6]

32 Due to the strong nonlinearity of softmax, its activity is much more peaked than in layer III and IV. [sent-78, score-0.398]

33 Blue circles illustrate how the activities of one row of inputs in Layer I travels through the hierarchy to affect the final decision layer. [sent-80, score-0.312]

34 Brown circles illustrate how one unit in the spatial prior layer comes into the integration process. [sent-81, score-0.487]

35 gration of more “signal” and less “noise” into the marginal posterior, whereas the opposite ˆ results from an invalid cue. [sent-82, score-0.415]

36 In layer I, activity of neuron ij, r ij (t), reports the log likelihood, log p(xt |µi , φj ) (throughout, we discretize space and orientation). [sent-88, score-0.5]

37 Layer 2 1 II combines this log likelihood information with the prior, rij (t) = rij (t) + log P (µi ) + at , 2 to yield the joint log posterior up to an additive constant at that makes min rij = 0. [sent-89, score-0.81]

38 Layer 3 2 III performs the marginalization rj (t) = log i exp(rij )+bt , to give the marginal posterior 3 in φ (up to a constant bt that makes min rj (t) = 0). [sent-90, score-0.903]

39 While this step (’log-of-sums’) looks computationally formidable for neural hardware, it has been shown [4] that under certain 3 2 conditions it can be well approximated by a (weighted) ’sum-of-logs’ rj (t) ≈ i ci rij +bt , where ci are weights optimized to minimize approximation error. [sent-91, score-0.483]

40 Layer IV neurons combine recurrent information and feedforward input from layer III to compute the log marginal ˆ (b) Valid & Invalid φ (a) Model Valid & Invalid RT 150 val inv 100 (c) Reaction Time vs. [sent-92, score-0.661]

41 (a) The distribution of reaction times for the invalid condition (γ = 0. [sent-105, score-0.442]

42 (b) Distribution of ˆ inferred φ is more tightly clustered around the true φ∗ (red dashed line) in valid case (blue) than the invalid case (red). [sent-108, score-0.483]

43 4 4 3 posterior given all data so far observed, rj (t) = rj (t−1) + rj (t) + ct , up to a constant ct . [sent-129, score-0.981]

44 Finally, layer V neurons perform a softmax operation to retrieve the exact marginal poste5 4 4 rior, rj (t) = exp(rj )/ k exp(rk ) = P (φj |Xt ), with the additive constants dropping out. [sent-130, score-0.801]

45 An example of activities at each layer of the network, along with the choice of prior p(µ) and tuning function fij , is shown in Fig 1. [sent-133, score-0.726]

46 4 Results We first verify that the model indeed exhibits the cue-induced validity effect, ie shorter RT and greater accuracy for valid-cue trials than invalid ones. [sent-134, score-0.503]

47 Figure 2 shows simulation results for 300 trials each of valid and invalid cue trials, for different values of γ, reflecting the model’s belief as to cue validity. [sent-136, score-0.949]

48 Consistent with data from a human Posner task [17], (c) shows that the VE increases with increasing perceived cue validity, as parameterized by γ, in both reaction times and error rates (precluding a simple speed-error trade-off). [sent-138, score-0.355]

49 Since we have an explicit model of not only the “behavioral output” but also the whole neural hierarchy, we can relate activities at various levels of representation to existing physiological data. [sent-139, score-0.259]

50 Ample evidence indicates that spatial attention to one side of the visual field increases stimulus-induced activities in the corresponding part of the visual cortex [19, 20]. [sent-140, score-0.69]

51 Fig 3(a) shows that our model qualitatively reproduces this effect; indeed it increases with γ, the perceived cue validity. [sent-141, score-0.259]

52 Electrophysiological data also shows that spatial attention has a multiplicative effect on orientation tuning responses in visual cortical neurons [8] (Fig 3(b)). [sent-142, score-0.806]

53 We see a similar phenomenon in the layer IV neurons (Fig 3(c); layer III similar, data not shown). [sent-143, score-0.692]

54 Fig 3(d) is a scatter-plot of log p(xt , φj )+c1 t for the valid condition versus the invalid condition, for various values of γ, along with the slope fit to the experiment of Fig 3(b) (Layer III similar, data not shown). [sent-144, score-0.586]

55 The linear least square error fits are good, and the slope increases with increasing confidence in the cued location (larger γ). [sent-145, score-0.188]

56 5 (b) Attention & V4 Activities (c) 30 cued uncued 2 rij 7 4 rj 6. [sent-148, score-0.635]

57 99 10 10 π/2 φ 0 0 π 5 Invalid 10 4 rj 2 Figure 3: Multiplicative gain modulation by spatial attention. [sent-156, score-0.507]

58 (a) rij activities, averaged over the half of layer II where the prior peaks, are greater for valid (blue, left) than invalid (red, right) conditions. [sent-157, score-1.025]

59 (b) Experimentally observed multiplicative modulation of V4 orientation tunings by spatial attention [8]. [sent-158, score-0.539]

60 (c) Similar multiplicative effect in layer IV in the model. [sent-159, score-0.432]

61 (d) Linear fits to scatter-plot of layer III activities for valid cue condition vs. [sent-160, score-0.852]

62 invalid cue condition show that the slope is greatest for large γ and smallest for small γ (magenta: γ = 0. [sent-161, score-0.574]

63 In valid cases, the effect of attention is to increase the certainty in the posterior marginal over φ, since the correct prior allows the relative suppression of noisy input from the irrelevant part of space. [sent-168, score-0.579]

64 Were the posterior marginal exactly Gaussian, the increased certainty would translate into a decreased variance. [sent-169, score-0.18]

65 Decreasing the variance increases the curvature, and therefore has a multiplicative effect on the activities (as in figure 3). [sent-171, score-0.393]

66 The approximate gaussianity of the marginal posterior comes from the accumulation of many independent samples over time and space, and something like the central limit theorem. [sent-172, score-0.219]

67 While it is difficult to show this multiplicative modulation rigorously, we can at least demonstrate it mathematically for the case where the spatial prior is very sharply peaked at its Gaussian mean y . [sent-173, score-0.409]

68 ˆ ˆ We can expand log p(x(t)|ˆ, φ) and compute its average over time µ log p(x(t)|ˆ, φ) µ t =C− N (fij (µ∗ , φ∗ ) − fij (ˆ, φ))2 µ 2 2σn ij . [sent-176, score-0.207]

69 log pinv (xt , φ) t + c2 (3) The derivation for a multiplicative effect on layer IV activities is very similar. [sent-178, score-0.76]

70 Another aspect of intermediate representation of interest is the way attention modifies the evidence accumulation process over time. [sent-179, score-0.268]

71 Fig 4 show the effect of cueing on the activities 5 of neuron rj ∗ (t), or P (φ∗ |Xt ), for all trials with correct responses. [sent-180, score-0.79]

72 The mean activity trajectory is higher for the valid cue case than the invalid one: in this case, spatial attention mainly acts through increasing the rate of evidence accumulation after stimulus onset (a) (b) 1 5 rj ∗ 0. [sent-181, score-1.509]

73 8 5 rj ∗ (c) 1 50 Time 0 0 Time 50 (f) γ=. [sent-188, score-0.302]

74 2 0 0 10 Time 20 0 −5 0 Time Figure 4: Accumulation of iid samples in orientation discrimination, and dependence on prior belief 5 about stimulus location. [sent-198, score-0.299]

75 (a-c) Average activity of neuron rj ∗ , which represents P (φ∗ |Xt ), saturates to 100% certainty much faster for valid cue trials (blue) than invalid cue trials (red). [sent-199, score-1.447]

76 (d) First 15 time steps (from stimulus onset) of the invalid cue traces from (a-c) are aligned to stimulus onset; cyan line denotes stimulus onset. [sent-206, score-0.97]

77 (e) Last 8 time steps of the invalid traces from (a-c) are aligned to decision threshold-crossing; there is no clear separation as a function γ. [sent-208, score-0.495]

78 (f) Multiplicative gain modulation of attention on V4 orientation tuning curves. [sent-209, score-0.424]

79 This attentional effect is more pronounced when the system is more confident about its prior information ((a) γ = 0. [sent-212, score-0.304]

80 Figure 4 (d) shows the average traces for invalid-cueing trials aligned to the stimulus onset and (e) to the decision threshold crossing. [sent-217, score-0.413]

81 These results bear remarkable similarities to the LIP neuronal activities recorded during monkey perceptual decision-making [13] (shown in (f)). [sent-218, score-0.303]

82 In the stimulus-aligned case, the traces rise linearly at first and then tail off somewhat, and the rate of rise increases for lower (effective) noise. [sent-219, score-0.183]

83 We consider a class of attentional tasks for which top-down modulation of sensory processing can be conceptualized as changes in the prior distribution over implicit stimulus dimensions. [sent-223, score-0.593]

84 We use the specific example of the Posner spatial cueing task to relate the characteristics of this neural architecture to experimental literature. [sent-224, score-0.24]

85 The way these measures depend on valid versus invalid cueing, and on the exact perceived validity of the cue, are similar to those observed in attentional experiments. [sent-226, score-0.783]

86 In this case, the spatial prior affects the marginal posterior over φ by altering the relative importance of joint posterior terms in the marginalization process. [sent-230, score-0.423]

87 This leads to the difference in performance between valid and invalid trials, a difference that increases with γ. [sent-231, score-0.518]

88 This model elaborates on an earlier phenomenological model [9], by showing explicitly how marginalizing (in layer III) over activities biased by the prior (in layer II) produces the effect. [sent-232, score-0.887]

89 The model presents one possible reconciliation of cortical and neuromodulatory representations of uncertainty. [sent-234, score-0.203]

90 The sensory-driven activities (layer I in this model) themselves encode bottom-up uncertainty, including sensory receptor noise and any processing noise that have occurred up until then. [sent-235, score-0.334]

91 One determines the locus and spatial extent of visual attention, the other specifies the relative importance of this top-down bias compared to the bottom-up stimulus-driven input. [sent-237, score-0.198]

92 The first is highly specific in modality and featural dimension, presumably originating from higher visual cortical areas (eg parietal cortex for spatial attention, inferotemporal cortex for complex featural attention). [sent-238, score-0.495]

93 The perceptual decision strategy employed in this model is a natural multi-dimensional extension of SPRT [10], by monitoring the first-time passage of any one of the posterior values crossing a fixed decision threshold. [sent-241, score-0.258]

94 Note that the distribution of reaction times is skewed to the right (Fig 2(a)), as is commonly observed in visual discrimination tasks [11]. [sent-243, score-0.219]

95 One is that of noise: our network performs exact Bayesian inference when activities are deterministic. [sent-249, score-0.226]

96 Based on a slightly different task involving sustained attention or vigilance [23], Brown et al [24] have made the interesting suggestion that one role for noradrenergic neuromodulation is to implement a change in the integration strategy when a stimulus is detected. [sent-253, score-0.341]

97 Effects of attention on orientation-tuning functions of single neurons in Macaque cortical area V4. [sent-279, score-0.354]

98 Cholinergic neurotransmission influences overt orientation of visuospatial attention in the rat. [sent-304, score-0.249]

99 Expected and unexpected uncertainties control allocation of attention in a novel attentional learning task. [sent-307, score-0.39]

100 Acetylcholine suppresses the spread of excitation in the visual cortex revealed by optical recording: possible differential effect depending on the source of input. [sent-324, score-0.214]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('invalid', 0.346), ('rj', 0.302), ('layer', 0.3), ('activities', 0.226), ('attentional', 0.196), ('cue', 0.189), ('rij', 0.181), ('fig', 0.168), ('attention', 0.154), ('xt', 0.139), ('valid', 0.137), ('posner', 0.133), ('modulation', 0.115), ('cued', 0.114), ('stimulus', 0.113), ('cortical', 0.108), ('sensory', 0.108), ('iii', 0.101), ('reaction', 0.096), ('orientation', 0.095), ('cueing', 0.095), ('sprt', 0.095), ('neurons', 0.092), ('spatial', 0.09), ('trials', 0.088), ('multiplicative', 0.085), ('fij', 0.079), ('neurosci', 0.079), ('dayan', 0.077), ('perceptual', 0.077), ('ach', 0.076), ('featural', 0.076), ('inv', 0.076), ('posterior', 0.075), ('accumulation', 0.075), ('visual', 0.075), ('uncertainty', 0.07), ('marginal', 0.069), ('iv', 0.069), ('validity', 0.069), ('log', 0.064), ('onset', 0.063), ('prior', 0.061), ('brown', 0.061), ('comput', 0.06), ('val', 0.06), ('tuning', 0.06), ('peaked', 0.058), ('traces', 0.058), ('acetylcholine', 0.057), ('neuromodulatory', 0.057), ('suppresses', 0.057), ('yu', 0.056), ('architecture', 0.055), ('decision', 0.053), ('marginalization', 0.053), ('layers', 0.052), ('discrimination', 0.048), ('effect', 0.047), ('rise', 0.045), ('blue', 0.044), ('activity', 0.04), ('unexpected', 0.04), ('intermediate', 0.039), ('slope', 0.039), ('bt', 0.038), ('aligned', 0.038), ('annu', 0.038), ('cholinergic', 0.038), ('inferential', 0.038), ('lip', 0.038), ('noradrenergic', 0.038), ('orienting', 0.038), ('pinv', 0.038), ('pval', 0.038), ('reconciliation', 0.038), ('uncued', 0.038), ('softmax', 0.038), ('decisions', 0.036), ('certainty', 0.036), ('integration', 0.036), ('priors', 0.036), ('increases', 0.035), ('red', 0.035), ('bayesian', 0.035), ('cortex', 0.035), ('perceived', 0.035), ('codes', 0.034), ('hierarchy', 0.033), ('physiological', 0.033), ('rt', 0.033), ('adv', 0.033), ('info', 0.033), ('locus', 0.033), ('phasic', 0.033), ('psychologists', 0.033), ('neuron', 0.032), ('ii', 0.032), ('dimension', 0.03), ('iid', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

Author: Angela J. Yu, Peter Dayan

Abstract: We study the synthesis of neural coding, selective attention and perceptual decision making. A hierarchical neural architecture is proposed, which implements Bayesian integration of noisy sensory input and topdown attentional priors, leading to sound perceptual discrimination. The model offers an explicit explanation for the experimentally observed modulation that prior information in one stimulus feature (location) can have on an independent feature (orientation). The network’s intermediate levels of representation instantiate known physiological properties of visual cortical neurons. The model also illustrates a possible reconciliation of cortical and neuromodulatory representations of uncertainty. 1

2 0.20224725 140 nips-2004-Optimal Information Decoding from Neuronal Populations with Specific Stimulus Selectivity

Author: Marcelo A. Montemurro, Stefano Panzeri

Abstract: A typical neuron in visual cortex receives most inputs from other cortical neurons with a roughly similar stimulus preference. Does this arrangement of inputs allow efficient readout of sensory information by the target cortical neuron? We address this issue by using simple modelling of neuronal population activity and information theoretic tools. We find that efficient synaptic information transmission requires that the tuning curve of the afferent neurons is approximately as wide as the spread of stimulus preferences of the afferent neurons reaching the target neuron. By meta analysis of neurophysiological data we found that this is the case for cortico-cortical inputs to neurons in visual cortex. We suggest that the organization of V1 cortico-cortical synaptic inputs allows optimal information transmission. 1

3 0.17269127 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

Author: Rajesh P. Rao

Abstract: There is growing evidence from psychophysical and neurophysiological studies that the brain utilizes Bayesian principles for inference and decision making. An important open question is how Bayesian inference for arbitrary graphical models can be implemented in networks of spiking neurons. In this paper, we show that recurrent networks of noisy integrate-and-fire neurons can perform approximate Bayesian inference for dynamic and hierarchical graphical models. The membrane potential dynamics of neurons is used to implement belief propagation in the log domain. The spiking probability of a neuron is shown to approximate the posterior probability of the preferred state encoded by the neuron, given past inputs. We illustrate the model using two examples: (1) a motion detection network in which the spiking probability of a direction-selective neuron becomes proportional to the posterior probability of motion in a preferred direction, and (2) a two-level hierarchical network that produces attentional effects similar to those observed in visual cortical areas V2 and V4. The hierarchical model offers a new Bayesian interpretation of attentional modulation in V2 and V4. 1

4 0.15729688 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making

Author: Peter M. Todd, Anja Dieckmann

Abstract: Simple lexicographic decision heuristics that consider cues one at a time in a particular order and stop searching for cues as soon as a decision can be made have been shown to be both accurate and frugal in their use of information. But much of the simplicity and success of these heuristics comes from using an appropriate cue order. For instance, the Take The Best heuristic uses validity order for cues, which requires considerable computation, potentially undermining the computational advantages of the simple decision mechanism. But many cue orders can achieve good decision performance, and studies of sequential search for data records have proposed a number of simple ordering rules that may be of use in constructing appropriate decision cue orders as well. Here we consider a range of simple cue ordering mechanisms, including tallying, swapping, and move-to-front rules, and show that they can find cue orders that lead to reasonable accuracy and considerable frugality when used with lexicographic decision heuristics. 1 O ne -Re ason De c i si on M aki ng and O r de r e d Se ar c h How do we know what information to consider when making a decision? Imagine the problem of deciding which of two objects or options is greater along some criterion, such as which of two cities is larger. We may know various facts about each city, such as whether they have a major sports team or a university or airport. To decide between them, we could weight and sum all the cues we know, or we could use a simpler lexicographic rule to look at one cue at a time in a particular order until we find a cue that discriminates between the options and indicates a choice [1]. Such lexicographic rules are used by people in a variety of decision tasks [2]-[4], and have been shown to be both accurate in their inferences and frugal in the amount of information they consider before making a decision. For instance, Gigerenzer and colleagues [5] demonstrated the surprising performance of several decision heuristics that stop information search as soon as one discriminating cue is found; because only that cue is used to make the decision, and no integration of information is involved, they called these heuristics “one-reason” decision mechanisms. Given some set of cues that can be looked up to make the decision, these heuristics differ mainly in the search rule that determines the order in which the information is searched. But then the question of what information to consider becomes, how are these search orders determined? Particular cue orders make a difference, as has been shown in research on the Take The Best heuristic (TTB) [6], [7]. TTB consists of three building blocks. (1) Search rule: Search through cues in the order of their validity, a measure of accuracy equal to the proportion of correct decisions made by a cue out of all the times that cue discriminates between pairs of options. (2) Stopping rule: Stop search as soon as one cue is found that discriminates between the two options. (3) Decision rule: Select the option to which the discriminating cue points, that is, the option that has the cue value associated with higher criterion values. The performance of TTB has been tested on several real-world data sets, ranging from professors’ salaries to fish fertility [8], in cross-validation comparisons with other more complex strategies. Across 20 data sets, TTB used on average only a third of the available cues (2.4 out of 7.7), yet still outperformed multiple linear regression in generalization accuracy (71% vs. 68%). The even simpler Minimalist heuristic, which searches through available cues in a random order, was more frugal (using 2.2 cues on average), yet still achieved 65% accuracy. But the fact that the accuracy of Minimalist lagged behind TTB by 6 percentage points indicates that part of the secret of TTB’s success lies in its ordered search. Moreover, in laboratory experiments [3], [4], [9], people using lexicographic decision strategies have been shown to employ cue orders based on the cues’ validities or a combination of validity and discrimination rate (proportion of decision pairs on which a cue discriminates between the two options). Thus, the cue order used by a lexicographic decision mechanism can make a considerable difference in accuracy; the same holds true for frugality, as we will see. But constructing an exact validity order, as used by Take The Best, takes considerable information and computation [10]. If there are N known objects to make decisions over, and C cues known for each object, then each of the C cues must be evaluated for whether it discriminates correctly (counting up R right decisions), incorrectly (W wrong decisions), or does not discriminate between each of the N·(N-1)/2 possible object pairs, yielding C·N·(N-1)/2 checks to perform to gather the information needed to compute cue validities (v = R/(R+W)) in this domain. But a decision maker typically does not know all of the objects to be decided upon, nor even all the cue values for those objects, ahead of time—is there any simpler way to find an accurate and frugal cue order? In this paper, we address this question through simulation-based comparison of a variety of simple cue-order-learning rules. Hope comes from two directions: first, there are many cue orders besides the exact validity ordering that can yield good performance; and second, research in computer science has demonstrated the efficacy of a range of simple ordering rules for a closely related search problem. Consequently, we find that simple mechanisms at the cue-order-learning stage can enable simple mechanisms at the decision stage, such as lexicographic one-reason decision heuristics, to perform well. 2 Si mpl e appr oac he s to c onstr uc ti ng c ue s e ar c h or de r s To compare different cue ordering rules, we evaluate the performance of different cue orders when used by a one-reason decision heuristic within a particular well-studied sample domain: large German cities, compared on the criterion of population size using 9 cues ranging from having a university to the presence of an intercity train line [6], [7]. Examining this domain makes it clear that there are many good possible cue orders. When used with one-reason stopping and decision building blocks, the mean accuracy of the 362,880 (9!) cue orders is 70%, equivalent to the performance expected from Minimalist. The accuracy of the validity order, 74.2%, falls toward the upper end of the accuracy range (62-75.8%), but there are still 7421 cue orders that do better than the validity order. The frugality of the search orders ranges from 2.53 cues per decision to 4.67, with a mean of 3.34 corresponding to using Minimalist; TTB has a frugality of 4.23, implying that most orders are more frugal. Thus, there are many accurate and frugal cue orders that could be found—a satisficing decision maker not requiring optimal performance need only land on one. An ordering problem of this kind has been studied in computer science for nearly four decades, and can provide us with a set of potential heuristics to test. Consider the case of a set of data records arranged in a list, each of which will be required during a set of retrievals with a particular probability pi. On each retrieval, a key is given (e.g. a record’s title) and the list is searched from the front to the end until the desired record, matching that key, is found. The goal is to minimize the mean search time for accessing the records in this list, for which the optimal ordering is in decreasing order of pi. But if these retrieval probabilities are not known ahead of time, how can the list be ordered after each successive retrieval to achieve fast access? This is the problem of self-organizing sequential search [11], [12]. A variety of simple sequential search heuristics have been proposed for this problem, centering on three main approaches: (1) transpose, in which a retrieved record is moved one position closer to the front of the list (i.e., swapping with the record in front of it); (2) move-to-front (MTF), in which a retrieved record is put at the front of the list, and all other records remain in the same relative order; and (3) count, in which a tally is kept of the number of times each record is retrieved, and the list is reordered in decreasing order of this tally after each retrieval. Because count rules require storing additional information, more attention has focused on the memory-free transposition and MTF rules. Analytic and simulation results (reviewed in [12]) have shown that while transposition rules can come closer to the optimal order asymptotically, in the short run MTF rules converge more quickly (as can count rules). This may make MTF (and count) rules more appealing as models of cue order learning by humans facing small numbers of decision trials. Furthermore, MTF rules are more responsive to local structure in the environment (e.g., clumped retrievals over time of a few records), and transposition can result in very poor performance under some circumstances (e.g., when neighboring pairs of “popular” records get trapped at the end of the list by repeatedly swapping places). It is important to note that there are important differences between the selforganizing sequential search problem and the cue-ordering problem we address here. In particular, when a record is sought that matches a particular key, search proceeds until the correct record is found. In contrast, when a decision is made lexicographically and the list of cues is searched through, there is no one “correct” cue to find—each cue may or may not discriminate (allow a decision to be made). Furthermore, once a discriminating cue is found, it may not even make the right decision. Thus, given feedback about whether a decision was right or wrong, a discriminating cue could potentially be moved up or down in the ordered list. This dissociation between making a decision or not (based on the cue discrimination rates), and making a right or wrong decision (based on the cue validities), means that there are two ordering criteria in this problem—frugality and accuracy—as opposed to the single order—search time—for records based on their retrieval probability pi . Because record search time corresponds to cue frugality, the heuristics that work well for the self-organizing sequential search task are likely to produce orders that emphasize frugality (reflecting cue discrimination rates) over accuracy in the cue-ordering task. Nonetheless, these heuristics offer a useful starting point for exploring cue-ordering rules. 2.1 The cue-ordering rules We focus on search order construction processes that are psychologically plausible by being frugal both in terms of information storage and in terms of computation. The decision situation we explore is different from the one assumed by Juslin and Persson [10] who strongly differentiate learning about objects from later making decisions about them. Instead we assume a learning-while-doing situation, consisting of tasks that have to be done repeatedly with feedback after each trial about the adequacy of one’s decision. For instance, we can observe on multiple occasions which of two supermarket checkout lines, the one we have chosen or (more likely) another one, is faster, and associate this outcome with cues including the lines’ lengths and the ages of their respective cashiers. In such situations, decision makers can learn about the differential usefulness of cues for solving the task via the feedback received over time. We compare several explicitly defined ordering rules that construct cue orders for use by lexicographic decision mechanisms applied to a particular probabilistic inference task: forced choice paired comparison, in which a decision maker has to infer which of two objects, each described by a set of binary cues, is “bigger” on a criterion—just the task for which TTB was formulated. After an inference has been made, feedback is given about whether a decision was right or wrong. Therefore, the order-learning algorithm has information about which cues were looked up, whether a cue discriminated, and whether a discriminating cue led to the right or wrong decision. The rules we propose differ in which pieces of information they use and how they use them. We classify the learning rules based on their memory requirement—high versus low—and their computational requirements in terms of full or partial reordering (see Table 1). Table 1: Learning rules classified by memory and computational requirements High memory load, complete reordering High memory load, local reordering Low memory load, local reordering Validity: reorders cues based on their current validity Tally swap: moves cue up (down) one position if it has made a correct (incorrect) decision if its tally of correct minus incorrect decisions is ( ) than that of next higher (lower) cue Simple swap: moves cue up one position after correct decision, and down after an incorrect decision Tally: reorders cues by number of correct minus incorrect decisions made so far Associative/delta rule: reorders cues by learned association strength Move-to-front (2 forms): Take The Last (TTL): moves discriminating cue to front TTL-correct: moves cue to front only if it correctly discriminates The validity rule, a type of count rule, is the most demanding of the rules we consider in terms of both memory requirements and computational complexity. It keeps a count of all discriminations made by a cue so far (in all the times that the cue was looked up) and a separate count of all the correct discriminations. Therefore, memory load is comparatively high. The validity of each cue is determined by dividing its current correct discrimination count by its total discrimination count. Based on these values computed after each decision, the rule reorders the whole set of cues from highest to lowest validity. The tally rule only keeps one count per cue, storing the number of correct decisions made by that cue so far minus the number of incorrect decisions. If a cue discriminates correctly on a given trial, one point is added to its tally, if it leads to an incorrect decision, one point is subtracted. The tally rule is less demanding in terms of memory and computation: Only one count is kept, no division is required. The simple swap rule uses the transposition rather than count approach. This rule has no memory of cue performance other than an ordered list of all cues, and just moves a cue up one position in this list whenever it leads to a correct decision, and down if it leads to an incorrect decision. In other words, a correctly deciding cue swaps positions with its nearest neighbor upwards in the cue order, and an incorrectly deciding cue swaps positions with its nearest neighbor downwards. The tally swap rule is a hybrid of the simple swap rule and the tally rule. It keeps a tally of correct minus incorrect discriminations per cue so far (so memory load is high) but only locally swaps cues: When a cue makes a correct decision and its tally is greater than or equal to that of its upward neighbor, the two cues swap positions. When a cue makes an incorrect decision and its tally is smaller than or equal to that of its downward neighbor, the two cues also swap positions. We also evaluate two types of move-to-front rules. First, the Take The Last (TTL) rule moves the last discriminating cue (that is, whichever cue was found to discriminate for the current decision) to the front of the order. This is equivalent to the Take The Last heuristic [6], [7], which uses a memory of cues that discriminated in the past to determine cue search order for subsequent decisions. Second, TTLcorrect moves the last discriminating cue to the front of the order only if it correctly discriminated; otherwise, the cue order remains unchanged. This rule thus takes accuracy as well as frugality into account. Finally, we include an associative learning rule that uses the delta rule to update cue weights according to whether they make correct or incorrect discriminations, and then reorders all cues in decreasing order of this weight after each decision. This corresponds to a simple network with nine input units encoding the difference in cue value between the two objects (A and B) being decided on (i.e., ini = -1 if cuei(A) cuei(B), and 0 if cuei(A)=cuei(B) or cuei was not checked) and with one output unit whose target value encodes the correct decision (t = 1 if criterion(A)>criterion(B), otherwise -1), and with the weights between inputs and output updated according to wi = lr · (t - ini·wi) · ini with learning rate lr = 0.1. We expect this rule to behave similarly to Oliver’s rule initially (moving a cue to the front of the list by giving it the largest weight when weights are small) and to swap later on (moving cues only a short distance once weights are larger). 3 Si mul ati on Study of Si mpl e O r de r i ng Rul e s To test the performance of these order learning rules, we use the German cities data set [6], [7], consisting of the 83 largest-population German cities (those with more than 100,000 inhabitants), described on 9 cues that give some information about population size. Discrimination rate and validity of the cues are negatively correlated (r = -.47). We present results averaged over 10,000 learning trials for each rule, starting from random initial cue orders. Each trial consisted of 100 decisions between randomly selected decision pairs. For each decision, the current cue order was used to look up cues until a discriminating cue was found, which was used to make the decision (employing a onereason or lexicographic decision strategy). After each decision, the cue order was updated using the particular order-learning rule. We start by considering the cumulative accuracies (i.e., online or amortized performance—[12]) of the rules, defined as the total percentage of correct decisions made so far at any point in the learning process. The contrasting measure of offline accuracy—how well the current learned cue order would do if it were applied to the entire test set—will be subsequently reported (see Figure 1). For all but the move-to-front rules, cumulative accuracies soon rise above that of the Minimalist heuristic (proportion correct = .70) which looks up cues in random order and thus serves as a lower benchmark. However, at least throughout the first 100 decisions, cumulative accuracies stay well below the (offline) accuracy that would be achieved by using TTB for all decisions (proportion correct = .74), looking up cues in the true order of their ecological validities. Except for the move-to-front rules, whose cumulative accuracies are very close to Minimalist (mean proportion correct in 100 decisions: TTL: .701; TTL-correct: .704), all learning rules perform on a surprisingly similar level, with less than one percentage point difference in favor of the most demanding rule (i.e., delta rule: .719) compared to the least (i.e., simple swap: .711; for comparison: tally swap: .715; tally: .716; validity learning rule: .719). Offline accuracies are slightly higher, again with the exception of the move to front rules (TTL: .699; TTL-correct: .702; simple swap: .714; tally swap: .719; tally: .721; validity learning rule: .724; delta rule: .725; see Figure 1). In longer runs (10,000 decisions) the validity learning rule is able to converge on TTB’s accuracy, but the tally rule’s performance changes little (to .73). Figure 1: Mean offline accuracy of order learning rules Figure 2: Mean offline frugality of order learning rules All learning rules are, however, more frugal than TTB, and even more frugal than Minimalist, both in terms of online as well as offline frugality. Let us focus on their offline frugality (see Figure 2): On average, the rules look up fewer cues than Minimalist before reaching a decision. There is little difference between the associative rule, the tallying rules and the swapping rules (mean number of cues looked up in 100 decisions: delta rule: 3.20; validity learning rule: 3.21; tally: 3.01; tally swap: 3.04; simple swap: 3.13). Most frugal are the two move-to front rules (TTL-correct: 2.87; TTL: 2.83). Consistent with this finding, all of the learning rules lead to cue orders that show positive correlations with the discrimination rate cue order (reaching the following values after 100 decisions: validity learning rule: r = .18; tally: r = .29; tally swap: r = .24; simple swap: r = .18; TTL-correct: r = .48; TTL: r = .56). This means that cues that often lead to discriminations are more likely to end up in the first positions of the order. This is especially true for the move-to-front rules. In contrast, the cue orders resulting from all learning rules but the validity learning rule do not correlate or correlate negatively with the validity cue order, and even the correlations of the cue orders resulting from the validity learning rule after 100 decisions only reach an average r = .12. But why would the discrimination rates of cues exert more of a pull on cue order than validity, even when the validity learning rule is applied? As mentioned earlier, this is what we would expect for the move-to-front rules, but it was unexpected for the other rules. Part of the explanation comes from the fact that in the city data set we used for the simulations, validity and discrimination rate of cues are negatively correlated. Having a low discrimination rate means that a cue has little chance to be used and hence to demonstrate its high validity. Whatever learning rule is used, if such a cue is displaced downward to the lower end of the order by other cues, it may have few chances to escape to the higher ranks where it belongs. The problem is that when a decision pair is finally encountered for which that cue would lead to a correct decision, it is unlikely to be checked because other, more discriminating although less valid, cues are looked up before and already bring about a decision. Thus, because one-reason decision making is intertwined with the learning mechanism and so influences which cues can be learned about, what mainly makes a cue come early in the order is producing a high number of correct decisions and not so much a high ratio of correct discriminations to total discriminations regardless of base rates. This argument indicates that performance may differ in environments where cue validities and discrimination rates correlate positively. We tested the learning rules on one such data set (r=.52) of mammal species life expectancies, predicted from 9 cues. It also differs from the cities environment with a greater difference between TTB’s and Minimalist’s performance (6.5 vs. 4 percentage points). In terms of offline accuracy, the validity learning rule now indeed more closely approaches TTB’s accuracy after 100 decisions (.773 vs. .782)., The tally rule, in contrast, behaves very much as in the cities environment, reaching an accuracy of .752, halfway between TTB and Minimalist (accuracy =.716). Thus only some learning rules can profit from the positive correlation. 4 D i s c u s s i on Most of the simpler cue order learning rules we have proposed do not fall far behind a validity learning rule in accuracy, and although the move-to-front rules cannot beat the accuracy achieved if cues were selected randomly, they compensate for this failure by being highly frugal. Interestingly, the rules that do achieve higher accuracy than Minimalist also beat random cue selection in terms of frugality. On the other hand, all rules, even the delta rule and the validity learning rule, stay below TTB’s accuracy across a relatively high number of decisions. But often it is necessary to make good decisions without much experience. Therefore, learning rules should be preferred that quickly lead to orders with good performance. The relatively complex rules with relatively high memory requirement, i.e., the delta and the validity learning rule, but also the tally learning rule, more quickly rise in accuracy compared the rules with lower requirements. Especially the tally rule thus represents a good compromise between cost, correctness and psychological plausibility considerations. Remember that the rules based on tallies assume full memory of all correct minus incorrect decisions made by a cue so far. But this does not make the rule implausible, at least from a psychological perspective, even though computer scientists were reluctant to adopt such counting approaches because of their extra memory requirements. There is considerable evidence that people are actually very good at remembering the frequencies of events. Hasher and Zacks [13] conclude from a wide range of studies that frequencies are encoded in an automatic way, implying that people are sensitive to this information without intention or special effort. Estes [14] pointed out the role frequencies play in decision making as a shortcut for probabilities. Further, the tally rule and the tally swap rule are comparatively simple, not having to keep track of base rates or perform divisions as does the validity rule. From the other side, the simple swap and move to front rules may not be much simpler, because storing a cue order may be about as demanding as storing a set of tallies. We have run experiments (reported elsewhere) in which indeed the tally swap rule best accounts for people’s actual processes of ordering cues. Our goal in this paper was to explore how well simple cue-ordering rules could work in conjunction with lexicographic decision strategies. This is important because it is necessary to take into account the set-up costs of a heuristic in addition to its application costs when considering the mechanism’s overall simplicity. As the example of the validity search order of TTB shows, what is easy to apply may not necessarily be so easy to set up. But simple rules can also be at work in the construction of a heuristic’s building blocks. We have proposed such rules for the construction of one building block, the search order. Simple learning rules inspired by research in computer science can enable a one-reason decision heuristic to perform only slightly worse than if it had full knowledge of cue validities from the very beginning. Giving up the assumption of full a priori knowledge for the slight decrease in accuracy seems like a reasonable bargain: Through the addition of learning rules, one-reason decision heuristics might lose some of their appeal to decision theorists who were surprised by the performance of such simple mechanisms compared to more complex algorithms, but they gain psychological plausibility and so become more attractive as explanations for human decision behavior. References [1] Fishburn, P.C. (1974). Lexicographic orders, utilities and decision rules: A survey. Management Science, 20, 1442-1471. [2] Payne, J.W., Bettman, J.R., & Johnson, E.J. (1993). The adaptive decision maker. New York: Cambridge University Press. [3] Bröder, A. (2000). Assessing the empirical validity of the “Take-The-Best” heuristic as a model of human probabilistic inference. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26 (5), 1332-1346. [4] Bröder, A. (2003). Decision making with the “adaptive toolbox”: Influence of environmental structure, intelligence, and working memory load. Journal of Experimental Psychology: Learning, Memory, & Cognition, 29, 611-625. [5] Gigerenzer, G., Todd, P.M., & The ABC Research Group (1999). Simple heuristics that make us smart. New York: Oxford University Press. [6] Gigerenzer, G., & Goldstein, D.G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103 (4), 650-669. [7] Gigerenzer, G., & Goldstein, D.G. (1999). Betting on one good reason: The Take The Best Heuristic. In G. Gigerenzer, P.M. Todd & The ABC Research Group, Simple heuristics that make us smart. New York: Oxford University Press. [8] Czerlinski, J., Gigerenzer, G., & Goldstein, D.G. (1999). How good are simple heuristics? In G. Gigerenzer, P.M. Todd & The ABC Research Group, Simple heuristics that make us smart. New York: Oxford University Press. [9] Newell, B.R., & Shanks, D.R. (2003). Take the best or look at the rest? Factors influencing ‘one-reason’ decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 53-65. [10] Juslin, P., & Persson, M. (2002). PROBabilities from EXemplars (PROBEX): a “lazy” algorithm for probabilistic inference from generic knowledge. Cognitive Science, 26, 563-607. [11] Rivest, R. (1976). On self-organizing sequential search heuristics. Communications of the ACM, 19(2), 63-67. [12] Bentley, J.L. & McGeoch, C.C. (1985). Amortized analyses of self-organizing sequential search heuristics. Communications of the ACM, 28(4), 404-411. [13] Hasher, L., & Zacks, R.T. (1984). Automatic Processing of fundamental information: The case of frequency of occurrence. American Psychologist, 39, 1372-1388. [14] Estes, W.K. (1976). The cognitive side of probability learning. Psychological Review, 83, 3764.

5 0.1402999 33 nips-2004-Brain Inspired Reinforcement Learning

Author: Françcois Rivest, Yoshua Bengio, John Kalaska

Abstract: Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. In contrast, the human brain readily and autonomously finds the complex features when provided with sufficient training. Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. The algorithms are compared and evaluated on the Acrobot task. 1

6 0.11983059 28 nips-2004-Bayesian inference in spiking neurons

7 0.11663673 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern

8 0.10490578 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

9 0.099715181 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

10 0.096157245 148 nips-2004-Probabilistic Computation in Spiking Populations

11 0.09104725 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception

12 0.084146626 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

13 0.071250424 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

14 0.06956549 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

15 0.064373896 193 nips-2004-Theories of Access Consciousness

16 0.060818747 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

17 0.060384598 99 nips-2004-Learning Hyper-Features for Visual Identification

18 0.059393164 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

19 0.058615956 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

20 0.058192104 175 nips-2004-Stable adaptive control with online learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.194), (1, -0.197), (2, 0.031), (3, -0.011), (4, 0.058), (5, -0.004), (6, 0.099), (7, -0.009), (8, 0.05), (9, -0.085), (10, -0.013), (11, 0.015), (12, -0.026), (13, -0.184), (14, 0.16), (15, -0.037), (16, -0.055), (17, -0.006), (18, 0.033), (19, 0.064), (20, 0.082), (21, 0.024), (22, 0.126), (23, -0.296), (24, 0.059), (25, -0.052), (26, 0.043), (27, -0.144), (28, 0.193), (29, 0.13), (30, 0.087), (31, -0.073), (32, 0.005), (33, -0.026), (34, 0.025), (35, 0.111), (36, 0.125), (37, -0.031), (38, 0.005), (39, -0.074), (40, -0.127), (41, 0.028), (42, 0.001), (43, 0.056), (44, 0.042), (45, -0.115), (46, 0.081), (47, -0.077), (48, 0.026), (49, -0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96613133 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

Author: Angela J. Yu, Peter Dayan

Abstract: We study the synthesis of neural coding, selective attention and perceptual decision making. A hierarchical neural architecture is proposed, which implements Bayesian integration of noisy sensory input and topdown attentional priors, leading to sound perceptual discrimination. The model offers an explicit explanation for the experimentally observed modulation that prior information in one stimulus feature (location) can have on an independent feature (orientation). The network’s intermediate levels of representation instantiate known physiological properties of visual cortical neurons. The model also illustrates a possible reconciliation of cortical and neuromodulatory representations of uncertainty. 1

2 0.70664084 140 nips-2004-Optimal Information Decoding from Neuronal Populations with Specific Stimulus Selectivity

Author: Marcelo A. Montemurro, Stefano Panzeri

Abstract: A typical neuron in visual cortex receives most inputs from other cortical neurons with a roughly similar stimulus preference. Does this arrangement of inputs allow efficient readout of sensory information by the target cortical neuron? We address this issue by using simple modelling of neuronal population activity and information theoretic tools. We find that efficient synaptic information transmission requires that the tuning curve of the afferent neurons is approximately as wide as the spread of stimulus preferences of the afferent neurons reaching the target neuron. By meta analysis of neurophysiological data we found that this is the case for cortico-cortical inputs to neurons in visual cortex. We suggest that the organization of V1 cortico-cortical synaptic inputs allows optimal information transmission. 1

3 0.67213845 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making

Author: Peter M. Todd, Anja Dieckmann

Abstract: Simple lexicographic decision heuristics that consider cues one at a time in a particular order and stop searching for cues as soon as a decision can be made have been shown to be both accurate and frugal in their use of information. But much of the simplicity and success of these heuristics comes from using an appropriate cue order. For instance, the Take The Best heuristic uses validity order for cues, which requires considerable computation, potentially undermining the computational advantages of the simple decision mechanism. But many cue orders can achieve good decision performance, and studies of sequential search for data records have proposed a number of simple ordering rules that may be of use in constructing appropriate decision cue orders as well. Here we consider a range of simple cue ordering mechanisms, including tallying, swapping, and move-to-front rules, and show that they can find cue orders that lead to reasonable accuracy and considerable frugality when used with lexicographic decision heuristics. 1 O ne -Re ason De c i si on M aki ng and O r de r e d Se ar c h How do we know what information to consider when making a decision? Imagine the problem of deciding which of two objects or options is greater along some criterion, such as which of two cities is larger. We may know various facts about each city, such as whether they have a major sports team or a university or airport. To decide between them, we could weight and sum all the cues we know, or we could use a simpler lexicographic rule to look at one cue at a time in a particular order until we find a cue that discriminates between the options and indicates a choice [1]. Such lexicographic rules are used by people in a variety of decision tasks [2]-[4], and have been shown to be both accurate in their inferences and frugal in the amount of information they consider before making a decision. For instance, Gigerenzer and colleagues [5] demonstrated the surprising performance of several decision heuristics that stop information search as soon as one discriminating cue is found; because only that cue is used to make the decision, and no integration of information is involved, they called these heuristics “one-reason” decision mechanisms. Given some set of cues that can be looked up to make the decision, these heuristics differ mainly in the search rule that determines the order in which the information is searched. But then the question of what information to consider becomes, how are these search orders determined? Particular cue orders make a difference, as has been shown in research on the Take The Best heuristic (TTB) [6], [7]. TTB consists of three building blocks. (1) Search rule: Search through cues in the order of their validity, a measure of accuracy equal to the proportion of correct decisions made by a cue out of all the times that cue discriminates between pairs of options. (2) Stopping rule: Stop search as soon as one cue is found that discriminates between the two options. (3) Decision rule: Select the option to which the discriminating cue points, that is, the option that has the cue value associated with higher criterion values. The performance of TTB has been tested on several real-world data sets, ranging from professors’ salaries to fish fertility [8], in cross-validation comparisons with other more complex strategies. Across 20 data sets, TTB used on average only a third of the available cues (2.4 out of 7.7), yet still outperformed multiple linear regression in generalization accuracy (71% vs. 68%). The even simpler Minimalist heuristic, which searches through available cues in a random order, was more frugal (using 2.2 cues on average), yet still achieved 65% accuracy. But the fact that the accuracy of Minimalist lagged behind TTB by 6 percentage points indicates that part of the secret of TTB’s success lies in its ordered search. Moreover, in laboratory experiments [3], [4], [9], people using lexicographic decision strategies have been shown to employ cue orders based on the cues’ validities or a combination of validity and discrimination rate (proportion of decision pairs on which a cue discriminates between the two options). Thus, the cue order used by a lexicographic decision mechanism can make a considerable difference in accuracy; the same holds true for frugality, as we will see. But constructing an exact validity order, as used by Take The Best, takes considerable information and computation [10]. If there are N known objects to make decisions over, and C cues known for each object, then each of the C cues must be evaluated for whether it discriminates correctly (counting up R right decisions), incorrectly (W wrong decisions), or does not discriminate between each of the N·(N-1)/2 possible object pairs, yielding C·N·(N-1)/2 checks to perform to gather the information needed to compute cue validities (v = R/(R+W)) in this domain. But a decision maker typically does not know all of the objects to be decided upon, nor even all the cue values for those objects, ahead of time—is there any simpler way to find an accurate and frugal cue order? In this paper, we address this question through simulation-based comparison of a variety of simple cue-order-learning rules. Hope comes from two directions: first, there are many cue orders besides the exact validity ordering that can yield good performance; and second, research in computer science has demonstrated the efficacy of a range of simple ordering rules for a closely related search problem. Consequently, we find that simple mechanisms at the cue-order-learning stage can enable simple mechanisms at the decision stage, such as lexicographic one-reason decision heuristics, to perform well. 2 Si mpl e appr oac he s to c onstr uc ti ng c ue s e ar c h or de r s To compare different cue ordering rules, we evaluate the performance of different cue orders when used by a one-reason decision heuristic within a particular well-studied sample domain: large German cities, compared on the criterion of population size using 9 cues ranging from having a university to the presence of an intercity train line [6], [7]. Examining this domain makes it clear that there are many good possible cue orders. When used with one-reason stopping and decision building blocks, the mean accuracy of the 362,880 (9!) cue orders is 70%, equivalent to the performance expected from Minimalist. The accuracy of the validity order, 74.2%, falls toward the upper end of the accuracy range (62-75.8%), but there are still 7421 cue orders that do better than the validity order. The frugality of the search orders ranges from 2.53 cues per decision to 4.67, with a mean of 3.34 corresponding to using Minimalist; TTB has a frugality of 4.23, implying that most orders are more frugal. Thus, there are many accurate and frugal cue orders that could be found—a satisficing decision maker not requiring optimal performance need only land on one. An ordering problem of this kind has been studied in computer science for nearly four decades, and can provide us with a set of potential heuristics to test. Consider the case of a set of data records arranged in a list, each of which will be required during a set of retrievals with a particular probability pi. On each retrieval, a key is given (e.g. a record’s title) and the list is searched from the front to the end until the desired record, matching that key, is found. The goal is to minimize the mean search time for accessing the records in this list, for which the optimal ordering is in decreasing order of pi. But if these retrieval probabilities are not known ahead of time, how can the list be ordered after each successive retrieval to achieve fast access? This is the problem of self-organizing sequential search [11], [12]. A variety of simple sequential search heuristics have been proposed for this problem, centering on three main approaches: (1) transpose, in which a retrieved record is moved one position closer to the front of the list (i.e., swapping with the record in front of it); (2) move-to-front (MTF), in which a retrieved record is put at the front of the list, and all other records remain in the same relative order; and (3) count, in which a tally is kept of the number of times each record is retrieved, and the list is reordered in decreasing order of this tally after each retrieval. Because count rules require storing additional information, more attention has focused on the memory-free transposition and MTF rules. Analytic and simulation results (reviewed in [12]) have shown that while transposition rules can come closer to the optimal order asymptotically, in the short run MTF rules converge more quickly (as can count rules). This may make MTF (and count) rules more appealing as models of cue order learning by humans facing small numbers of decision trials. Furthermore, MTF rules are more responsive to local structure in the environment (e.g., clumped retrievals over time of a few records), and transposition can result in very poor performance under some circumstances (e.g., when neighboring pairs of “popular” records get trapped at the end of the list by repeatedly swapping places). It is important to note that there are important differences between the selforganizing sequential search problem and the cue-ordering problem we address here. In particular, when a record is sought that matches a particular key, search proceeds until the correct record is found. In contrast, when a decision is made lexicographically and the list of cues is searched through, there is no one “correct” cue to find—each cue may or may not discriminate (allow a decision to be made). Furthermore, once a discriminating cue is found, it may not even make the right decision. Thus, given feedback about whether a decision was right or wrong, a discriminating cue could potentially be moved up or down in the ordered list. This dissociation between making a decision or not (based on the cue discrimination rates), and making a right or wrong decision (based on the cue validities), means that there are two ordering criteria in this problem—frugality and accuracy—as opposed to the single order—search time—for records based on their retrieval probability pi . Because record search time corresponds to cue frugality, the heuristics that work well for the self-organizing sequential search task are likely to produce orders that emphasize frugality (reflecting cue discrimination rates) over accuracy in the cue-ordering task. Nonetheless, these heuristics offer a useful starting point for exploring cue-ordering rules. 2.1 The cue-ordering rules We focus on search order construction processes that are psychologically plausible by being frugal both in terms of information storage and in terms of computation. The decision situation we explore is different from the one assumed by Juslin and Persson [10] who strongly differentiate learning about objects from later making decisions about them. Instead we assume a learning-while-doing situation, consisting of tasks that have to be done repeatedly with feedback after each trial about the adequacy of one’s decision. For instance, we can observe on multiple occasions which of two supermarket checkout lines, the one we have chosen or (more likely) another one, is faster, and associate this outcome with cues including the lines’ lengths and the ages of their respective cashiers. In such situations, decision makers can learn about the differential usefulness of cues for solving the task via the feedback received over time. We compare several explicitly defined ordering rules that construct cue orders for use by lexicographic decision mechanisms applied to a particular probabilistic inference task: forced choice paired comparison, in which a decision maker has to infer which of two objects, each described by a set of binary cues, is “bigger” on a criterion—just the task for which TTB was formulated. After an inference has been made, feedback is given about whether a decision was right or wrong. Therefore, the order-learning algorithm has information about which cues were looked up, whether a cue discriminated, and whether a discriminating cue led to the right or wrong decision. The rules we propose differ in which pieces of information they use and how they use them. We classify the learning rules based on their memory requirement—high versus low—and their computational requirements in terms of full or partial reordering (see Table 1). Table 1: Learning rules classified by memory and computational requirements High memory load, complete reordering High memory load, local reordering Low memory load, local reordering Validity: reorders cues based on their current validity Tally swap: moves cue up (down) one position if it has made a correct (incorrect) decision if its tally of correct minus incorrect decisions is ( ) than that of next higher (lower) cue Simple swap: moves cue up one position after correct decision, and down after an incorrect decision Tally: reorders cues by number of correct minus incorrect decisions made so far Associative/delta rule: reorders cues by learned association strength Move-to-front (2 forms): Take The Last (TTL): moves discriminating cue to front TTL-correct: moves cue to front only if it correctly discriminates The validity rule, a type of count rule, is the most demanding of the rules we consider in terms of both memory requirements and computational complexity. It keeps a count of all discriminations made by a cue so far (in all the times that the cue was looked up) and a separate count of all the correct discriminations. Therefore, memory load is comparatively high. The validity of each cue is determined by dividing its current correct discrimination count by its total discrimination count. Based on these values computed after each decision, the rule reorders the whole set of cues from highest to lowest validity. The tally rule only keeps one count per cue, storing the number of correct decisions made by that cue so far minus the number of incorrect decisions. If a cue discriminates correctly on a given trial, one point is added to its tally, if it leads to an incorrect decision, one point is subtracted. The tally rule is less demanding in terms of memory and computation: Only one count is kept, no division is required. The simple swap rule uses the transposition rather than count approach. This rule has no memory of cue performance other than an ordered list of all cues, and just moves a cue up one position in this list whenever it leads to a correct decision, and down if it leads to an incorrect decision. In other words, a correctly deciding cue swaps positions with its nearest neighbor upwards in the cue order, and an incorrectly deciding cue swaps positions with its nearest neighbor downwards. The tally swap rule is a hybrid of the simple swap rule and the tally rule. It keeps a tally of correct minus incorrect discriminations per cue so far (so memory load is high) but only locally swaps cues: When a cue makes a correct decision and its tally is greater than or equal to that of its upward neighbor, the two cues swap positions. When a cue makes an incorrect decision and its tally is smaller than or equal to that of its downward neighbor, the two cues also swap positions. We also evaluate two types of move-to-front rules. First, the Take The Last (TTL) rule moves the last discriminating cue (that is, whichever cue was found to discriminate for the current decision) to the front of the order. This is equivalent to the Take The Last heuristic [6], [7], which uses a memory of cues that discriminated in the past to determine cue search order for subsequent decisions. Second, TTLcorrect moves the last discriminating cue to the front of the order only if it correctly discriminated; otherwise, the cue order remains unchanged. This rule thus takes accuracy as well as frugality into account. Finally, we include an associative learning rule that uses the delta rule to update cue weights according to whether they make correct or incorrect discriminations, and then reorders all cues in decreasing order of this weight after each decision. This corresponds to a simple network with nine input units encoding the difference in cue value between the two objects (A and B) being decided on (i.e., ini = -1 if cuei(A) cuei(B), and 0 if cuei(A)=cuei(B) or cuei was not checked) and with one output unit whose target value encodes the correct decision (t = 1 if criterion(A)>criterion(B), otherwise -1), and with the weights between inputs and output updated according to wi = lr · (t - ini·wi) · ini with learning rate lr = 0.1. We expect this rule to behave similarly to Oliver’s rule initially (moving a cue to the front of the list by giving it the largest weight when weights are small) and to swap later on (moving cues only a short distance once weights are larger). 3 Si mul ati on Study of Si mpl e O r de r i ng Rul e s To test the performance of these order learning rules, we use the German cities data set [6], [7], consisting of the 83 largest-population German cities (those with more than 100,000 inhabitants), described on 9 cues that give some information about population size. Discrimination rate and validity of the cues are negatively correlated (r = -.47). We present results averaged over 10,000 learning trials for each rule, starting from random initial cue orders. Each trial consisted of 100 decisions between randomly selected decision pairs. For each decision, the current cue order was used to look up cues until a discriminating cue was found, which was used to make the decision (employing a onereason or lexicographic decision strategy). After each decision, the cue order was updated using the particular order-learning rule. We start by considering the cumulative accuracies (i.e., online or amortized performance—[12]) of the rules, defined as the total percentage of correct decisions made so far at any point in the learning process. The contrasting measure of offline accuracy—how well the current learned cue order would do if it were applied to the entire test set—will be subsequently reported (see Figure 1). For all but the move-to-front rules, cumulative accuracies soon rise above that of the Minimalist heuristic (proportion correct = .70) which looks up cues in random order and thus serves as a lower benchmark. However, at least throughout the first 100 decisions, cumulative accuracies stay well below the (offline) accuracy that would be achieved by using TTB for all decisions (proportion correct = .74), looking up cues in the true order of their ecological validities. Except for the move-to-front rules, whose cumulative accuracies are very close to Minimalist (mean proportion correct in 100 decisions: TTL: .701; TTL-correct: .704), all learning rules perform on a surprisingly similar level, with less than one percentage point difference in favor of the most demanding rule (i.e., delta rule: .719) compared to the least (i.e., simple swap: .711; for comparison: tally swap: .715; tally: .716; validity learning rule: .719). Offline accuracies are slightly higher, again with the exception of the move to front rules (TTL: .699; TTL-correct: .702; simple swap: .714; tally swap: .719; tally: .721; validity learning rule: .724; delta rule: .725; see Figure 1). In longer runs (10,000 decisions) the validity learning rule is able to converge on TTB’s accuracy, but the tally rule’s performance changes little (to .73). Figure 1: Mean offline accuracy of order learning rules Figure 2: Mean offline frugality of order learning rules All learning rules are, however, more frugal than TTB, and even more frugal than Minimalist, both in terms of online as well as offline frugality. Let us focus on their offline frugality (see Figure 2): On average, the rules look up fewer cues than Minimalist before reaching a decision. There is little difference between the associative rule, the tallying rules and the swapping rules (mean number of cues looked up in 100 decisions: delta rule: 3.20; validity learning rule: 3.21; tally: 3.01; tally swap: 3.04; simple swap: 3.13). Most frugal are the two move-to front rules (TTL-correct: 2.87; TTL: 2.83). Consistent with this finding, all of the learning rules lead to cue orders that show positive correlations with the discrimination rate cue order (reaching the following values after 100 decisions: validity learning rule: r = .18; tally: r = .29; tally swap: r = .24; simple swap: r = .18; TTL-correct: r = .48; TTL: r = .56). This means that cues that often lead to discriminations are more likely to end up in the first positions of the order. This is especially true for the move-to-front rules. In contrast, the cue orders resulting from all learning rules but the validity learning rule do not correlate or correlate negatively with the validity cue order, and even the correlations of the cue orders resulting from the validity learning rule after 100 decisions only reach an average r = .12. But why would the discrimination rates of cues exert more of a pull on cue order than validity, even when the validity learning rule is applied? As mentioned earlier, this is what we would expect for the move-to-front rules, but it was unexpected for the other rules. Part of the explanation comes from the fact that in the city data set we used for the simulations, validity and discrimination rate of cues are negatively correlated. Having a low discrimination rate means that a cue has little chance to be used and hence to demonstrate its high validity. Whatever learning rule is used, if such a cue is displaced downward to the lower end of the order by other cues, it may have few chances to escape to the higher ranks where it belongs. The problem is that when a decision pair is finally encountered for which that cue would lead to a correct decision, it is unlikely to be checked because other, more discriminating although less valid, cues are looked up before and already bring about a decision. Thus, because one-reason decision making is intertwined with the learning mechanism and so influences which cues can be learned about, what mainly makes a cue come early in the order is producing a high number of correct decisions and not so much a high ratio of correct discriminations to total discriminations regardless of base rates. This argument indicates that performance may differ in environments where cue validities and discrimination rates correlate positively. We tested the learning rules on one such data set (r=.52) of mammal species life expectancies, predicted from 9 cues. It also differs from the cities environment with a greater difference between TTB’s and Minimalist’s performance (6.5 vs. 4 percentage points). In terms of offline accuracy, the validity learning rule now indeed more closely approaches TTB’s accuracy after 100 decisions (.773 vs. .782)., The tally rule, in contrast, behaves very much as in the cities environment, reaching an accuracy of .752, halfway between TTB and Minimalist (accuracy =.716). Thus only some learning rules can profit from the positive correlation. 4 D i s c u s s i on Most of the simpler cue order learning rules we have proposed do not fall far behind a validity learning rule in accuracy, and although the move-to-front rules cannot beat the accuracy achieved if cues were selected randomly, they compensate for this failure by being highly frugal. Interestingly, the rules that do achieve higher accuracy than Minimalist also beat random cue selection in terms of frugality. On the other hand, all rules, even the delta rule and the validity learning rule, stay below TTB’s accuracy across a relatively high number of decisions. But often it is necessary to make good decisions without much experience. Therefore, learning rules should be preferred that quickly lead to orders with good performance. The relatively complex rules with relatively high memory requirement, i.e., the delta and the validity learning rule, but also the tally learning rule, more quickly rise in accuracy compared the rules with lower requirements. Especially the tally rule thus represents a good compromise between cost, correctness and psychological plausibility considerations. Remember that the rules based on tallies assume full memory of all correct minus incorrect decisions made by a cue so far. But this does not make the rule implausible, at least from a psychological perspective, even though computer scientists were reluctant to adopt such counting approaches because of their extra memory requirements. There is considerable evidence that people are actually very good at remembering the frequencies of events. Hasher and Zacks [13] conclude from a wide range of studies that frequencies are encoded in an automatic way, implying that people are sensitive to this information without intention or special effort. Estes [14] pointed out the role frequencies play in decision making as a shortcut for probabilities. Further, the tally rule and the tally swap rule are comparatively simple, not having to keep track of base rates or perform divisions as does the validity rule. From the other side, the simple swap and move to front rules may not be much simpler, because storing a cue order may be about as demanding as storing a set of tallies. We have run experiments (reported elsewhere) in which indeed the tally swap rule best accounts for people’s actual processes of ordering cues. Our goal in this paper was to explore how well simple cue-ordering rules could work in conjunction with lexicographic decision strategies. This is important because it is necessary to take into account the set-up costs of a heuristic in addition to its application costs when considering the mechanism’s overall simplicity. As the example of the validity search order of TTB shows, what is easy to apply may not necessarily be so easy to set up. But simple rules can also be at work in the construction of a heuristic’s building blocks. We have proposed such rules for the construction of one building block, the search order. Simple learning rules inspired by research in computer science can enable a one-reason decision heuristic to perform only slightly worse than if it had full knowledge of cue validities from the very beginning. Giving up the assumption of full a priori knowledge for the slight decrease in accuracy seems like a reasonable bargain: Through the addition of learning rules, one-reason decision heuristics might lose some of their appeal to decision theorists who were surprised by the performance of such simple mechanisms compared to more complex algorithms, but they gain psychological plausibility and so become more attractive as explanations for human decision behavior. References [1] Fishburn, P.C. (1974). Lexicographic orders, utilities and decision rules: A survey. Management Science, 20, 1442-1471. [2] Payne, J.W., Bettman, J.R., & Johnson, E.J. (1993). The adaptive decision maker. New York: Cambridge University Press. [3] Bröder, A. (2000). Assessing the empirical validity of the “Take-The-Best” heuristic as a model of human probabilistic inference. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26 (5), 1332-1346. [4] Bröder, A. (2003). Decision making with the “adaptive toolbox”: Influence of environmental structure, intelligence, and working memory load. Journal of Experimental Psychology: Learning, Memory, & Cognition, 29, 611-625. [5] Gigerenzer, G., Todd, P.M., & The ABC Research Group (1999). Simple heuristics that make us smart. New York: Oxford University Press. [6] Gigerenzer, G., & Goldstein, D.G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103 (4), 650-669. [7] Gigerenzer, G., & Goldstein, D.G. (1999). Betting on one good reason: The Take The Best Heuristic. In G. Gigerenzer, P.M. Todd & The ABC Research Group, Simple heuristics that make us smart. New York: Oxford University Press. [8] Czerlinski, J., Gigerenzer, G., & Goldstein, D.G. (1999). How good are simple heuristics? In G. Gigerenzer, P.M. Todd & The ABC Research Group, Simple heuristics that make us smart. New York: Oxford University Press. [9] Newell, B.R., & Shanks, D.R. (2003). Take the best or look at the rest? Factors influencing ‘one-reason’ decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 53-65. [10] Juslin, P., & Persson, M. (2002). PROBabilities from EXemplars (PROBEX): a “lazy” algorithm for probabilistic inference from generic knowledge. Cognitive Science, 26, 563-607. [11] Rivest, R. (1976). On self-organizing sequential search heuristics. Communications of the ACM, 19(2), 63-67. [12] Bentley, J.L. & McGeoch, C.C. (1985). Amortized analyses of self-organizing sequential search heuristics. Communications of the ACM, 28(4), 404-411. [13] Hasher, L., & Zacks, R.T. (1984). Automatic Processing of fundamental information: The case of frequency of occurrence. American Psychologist, 39, 1372-1388. [14] Estes, W.K. (1976). The cognitive side of probability learning. Psychological Review, 83, 3764.

4 0.55516928 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: It has been demonstrated that basic aspects of human visual motion perception are qualitatively consistent with a Bayesian estimation framework, where the prior probability distribution on velocity favors slow speeds. Here, we present a refined probabilistic model that can account for the typical trial-to-trial variabilities observed in psychophysical speed perception experiments. We also show that data from such experiments can be used to constrain both the likelihood and prior functions of the model. Specifically, we measured matching speeds and thresholds in a two-alternative forced choice speed discrimination task. Parametric fits to the data reveal that the likelihood function is well approximated by a LogNormal distribution with a characteristic contrast-dependent variance, and that the prior distribution on velocity exhibits significantly heavier tails than a Gaussian, and approximately follows a power-law function. Humans do not perceive visual motion veridically. Various psychophysical experiments have shown that the perceived speed of visual stimuli is affected by stimulus contrast, with low contrast stimuli being perceived to move slower than high contrast ones [1, 2]. Computational models have been suggested that can qualitatively explain these perceptual effects. Commonly, they assume the perception of visual motion to be optimal either within a deterministic framework with a regularization constraint that biases the solution toward zero motion [3, 4], or within a probabilistic framework of Bayesian estimation with a prior that favors slow velocities [5, 6]. The solutions resulting from these two frameworks are similar (and in some cases identical), but the probabilistic framework provides a more principled formulation of the problem in terms of meaningful probabilistic components. Specifically, Bayesian approaches rely on a likelihood function that expresses the relationship between the noisy measurements and the quantity to be estimated, and a prior distribution that expresses the probability of encountering any particular value of that quantity. A probabilistic model can also provide a richer description, by defining a full probability density over the set of possible “percepts”, rather than just a single value. Numerous analyses of psychophysical experiments have made use of such distributions within the framework of signal detection theory in order to model perceptual behavior [7]. Previous work has shown that an ideal Bayesian observer model based on Gaussian forms µ posterior low contrast probability density probability density high contrast likelihood prior a posterior likelihood prior v ˆ v ˆ visual speed µ b visual speed Figure 1: Bayesian model of visual speed perception. a) For a high contrast stimulus, the likelihood has a narrow width (a high signal-to-noise ratio) and the prior induces only a small shift µ of the mean v of the posterior. b) For a low contrast stimuli, the measurement ˆ is noisy, leading to a wider likelihood. The shift µ is much larger and the perceived speed lower than under condition (a). for both likelihood and prior is sufficient to capture the basic qualitative features of global translational motion perception [5, 6]. But the behavior of the resulting model deviates systematically from human perceptual data, most importantly with regard to trial-to-trial variability and the precise form of interaction between contrast and perceived speed. A recent article achieved better fits for the model under the assumption that human contrast perception saturates [8]. In order to advance the theory of Bayesian perception and provide significant constraints on models of neural implementation, it seems essential to constrain quantitatively both the likelihood function and the prior probability distribution. In previous work, the proposed likelihood functions were derived from the brightness constancy constraint [5, 6] or other generative principles [9]. Also, previous approaches defined the prior distribution based on general assumptions and computational convenience, typically choosing a Gaussian with zero mean, although a Laplacian prior has also been suggested [4]. In this paper, we develop a more general form of Bayesian model for speed perception that can account for trial-to-trial variability. We use psychophysical speed discrimination data in order to constrain both the likelihood and the prior function. 1 1.1 Probabilistic Model of Visual Speed Perception Ideal Bayesian Observer Assume that an observer wants to obtain an estimate for a variable v based on a measurement m that she/he performs. A Bayesian observer “knows” that the measurement device is not ideal and therefore, the measurement m is affected by noise. Hence, this observer combines the information gained by the measurement m with a priori knowledge about v. Doing so (and assuming that the prior knowledge is valid), the observer will – on average – perform better in estimating v than just trusting the measurements m. According to Bayes’ rule 1 p(v|m) = p(m|v)p(v) (1) α the probability of perceiving v given m (posterior) is the product of the likelihood of v for a particular measurements m and the a priori knowledge about the estimated variable v (prior). α is a normalization constant independent of v that ensures that the posterior is a proper probability distribution. ^ ^ P(v2 > v1) 1 + Pcum=0.5 0 a b Pcum=0.875 vmatch vthres v2 Figure 2: 2AFC speed discrimination experiment. a) Two patches of drifting gratings were displayed simultaneously (motion without movement). The subject was asked to fixate the center cross and decide after the presentation which of the two gratings was moving faster. b) A typical psychometric curve obtained under such paradigm. The dots represent the empirical probability that the subject perceived stimulus2 moving faster than stimulus1. The speed of stimulus1 was fixed while v2 is varied. The point of subjective equality, vmatch , is the value of v2 for which Pcum = 0.5. The threshold velocity vthresh is the velocity for which Pcum = 0.875. It is important to note that the measurement m is an internal variable of the observer and is not necessarily represented in the same space as v. The likelihood embodies both the mapping from v to m and the noise in this mapping. So far, we assume that there is a monotonic function f (v) : v → vm that maps v into the same space as m (m-space). Doing so allows us to analytically treat m and vm in the same space. We will later propose a suitable form of the mapping function f (v). An ideal Bayesian observer selects the estimate that minimizes the expected loss, given the posterior and a loss function. We assume a least-squares loss function. Then, the optimal estimate v is the mean of the posterior in Equation (1). It is easy to see why this model ˆ of a Bayesian observer is consistent with the fact that perceived speed decreases with contrast. The width of the likelihood varies inversely with the accuracy of the measurements performed by the observer, which presumably decreases with decreasing contrast due to a decreasing signal-to-noise ratio. As illustrated in Figure 1, the shift in perceived speed towards slow velocities grows with the width of the likelihood, and thus a Bayesian model can qualitatively explain the psychophysical results [1]. 1.2 Two Alternative Forced Choice Experiment We would like to examine perceived speeds under a wide range of conditions in order to constrain a Bayesian model. Unfortunately, perceived speed is an internal variable, and it is not obvious how to design an experiment that would allow subjects to express it directly 1 . Perceived speed can only be accessed indirectly by asking the subject to compare the speed of two stimuli. For a given trial, an ideal Bayesian observer in such a two-alternative forced choice (2AFC) experimental paradigm simply decides on the basis of the two trial estimates v1 (stimulus1) and v2 (stimulus2) which stimulus moves faster. Each estimate v is based ˆ ˆ ˆ on a particular measurement m. For a given stimulus with speed v, an ideal Bayesian observer will produce a distribution of estimates p(ˆ|v) because m is noisy. Over trials, v the observers behavior can be described by classical signal detection theory based on the distributions of the estimates, hence e.g. the probability of perceiving stimulus2 moving 1 Although see [10] for an example of determining and even changing the prior of a Bayesian model for a sensorimotor task, where the estimates are more directly accessible. faster than stimulus1 is given as the cumulative probability Pcum (ˆ2 > v1 ) = v ˆ ∞ 0 p(ˆ2 |v2 ) v v2 ˆ 0 p(ˆ1 |v1 ) dˆ1 dˆ2 v v v (2) Pcum describes the full psychometric curve. Figure 2b illustrates the measured psychometric curve and its fit from such an experimental situation. 2 Experimental Methods We measured matching speeds (Pcum = 0.5) and thresholds (Pcum = 0.875) in a 2AFC speed discrimination task. Subjects were presented simultaneously with two circular patches of horizontally drifting sine-wave gratings for the duration of one second (Figure 2a). Patches were 3deg in diameter, and were displayed at 6deg eccentricity to either side of a fixation cross. The stimuli had an identical spatial frequency of 1.5 cycle/deg. One stimulus was considered to be the reference stimulus having one of two different contrast values (c1 =[0.075 0.5]) and one of five different speed values (u1 =[1 2 4 8 12] deg/sec) while the second stimulus (test) had one of five different contrast values (c2 =[0.05 0.1 0.2 0.4 0.8]) and a varying speed that was determined by an interleaved staircase procedure. For each condition there were 96 trials. Conditions were randomly interleaved, including a random choice of stimulus identity (test vs. reference) and motion direction (right vs. left). Subjects were asked to fixate during stimulus presentation and select the faster moving stimulus. The threshold experiment differed only in that auditory feedback was given to indicate the correctness of their decision. This did not change the outcome of the experiment but increased significantly the quality of the data and thus reduced the number of trials needed. 3 Analysis With the data from the speed discrimination experiments we could in principal apply a parametric fit using Equation (2) to derive the prior and the likelihood, but the optimization is difficult, and the fit might not be well constrained given the amount of data we have obtained. The problem becomes much more tractable given the following weak assumptions: • We consider the prior to be relatively smooth. • We assume that the measurement m is corrupted by additive Gaussian noise with a variance whose dependence on stimulus speed and contrast is separable. • We assume that there is a mapping function f (v) : v → vm that maps v into the space of m (m-space). In that space, the likelihood is convolutional i.e. the noise in the measurement directly defines the width of the likelihood. These assumptions allow us to relate the psychophysical data to our probabilistic model in a simple way. The following analysis is in the m-space. The point of subjective equality (Pcum = 0.5) is defined as where the expected values of the speed estimates are equal. We write E vm,1 ˆ vm,1 − E µ1 = E vm,2 ˆ = vm,2 − E µ2 (3) where E µ is the expected shift of the perceived speed compared to the veridical speed. For the discrimination threshold experiment, above assumptions imply that the variance var vm of the speed estimates vm is equal for both stimuli. Then, (2) predicts that the ˆ ˆ discrimination threshold is proportional to the standard deviation, thus vm,2 − vm,1 = γ var vm ˆ (4) likelihood a b prior vm Figure 3: Piece-wise approximation We perform a parametric fit by assuming the prior to be piece-wise linear and the likelihood to be LogNormal (Gaussian in the m-space). where γ is a constant that depends on the threshold criterion Pcum and the exact shape of p(ˆm |vm ). v 3.1 Estimating the prior and likelihood In order to extract the prior and the likelihood of our model from the data, we have to find a generic local form of the prior and the likelihood and relate them to the mean and the variance of the speed estimates. As illustrated in Figure 3, we assume that the likelihood is Gaussian with a standard deviation σ(c, vm ). Furthermore, the prior is assumed to be wellapproximated by a first-order Taylor series expansion over the velocity ranges covered by the likelihood. We parameterize this linear expansion of the prior as p(vm ) = avm + b. We now can derive a posterior for this local approximation of likelihood and prior and then define the perceived speed shift µ(m). The posterior can be written as 2 vm 1 1 p(m|vm )p(vm ) = [exp(− )(avm + b)] α α 2σ(c, vm )2 where α is the normalization constant ∞ b p(m|vm )p(vm )dvm = π2σ(c, vm )2 α= 2 −∞ p(vm |m) = (5) (6) We can compute µ(m) as the first order moment of the posterior for a given m. Exploiting the symmetries around the origin, we find ∞ a(m) µ(m) = σ(c, vm )2 vp(vm |m)dvm ≡ (7) b(m) −∞ The expected value of µ(m) is equal to the value of µ at the expected value of the measurement m (which is the stimulus velocity vm ), thus a(vm ) σ(c, vm )2 E µ = µ(m)|m=vm = (8) b(vm ) Similarly, we derive var vm . Because the estimator is deterministic, the variance of the ˆ estimate only depends on the variance of the measurement m. For a given stimulus, the variance of the estimate can be well approximated by ∂ˆm (m) v var vm = var m ( ˆ |m=vm )2 (9) ∂m ∂µ(m) |m=vm )2 ≈ var m = var m (1 − ∂m Under the assumption of a locally smooth prior, the perceived velocity shift remains locally constant. The variance of the perceived speed vm becomes equal to the variance of the ˆ measurement m, which is the variance of the likelihood (in the m-space), thus var vm = σ(c, vm )2 ˆ (10) With (3) and (4), above derivations provide a simple dependency of the psychophysical data to the local parameters of the likelihood and the prior. 3.2 Choosing a Logarithmic speed representation We now want to choose the appropriate mapping function f (v) that maps v to the m-space. We define the m-space as the space in which the likelihood is Gaussian with a speedindependent width. We have shown that discrimination threshold is proportional to the width of the likelihood (4), (10). Also, we know from the psychophysics literature that visual speed discrimination approximately follows a Weber-Fechner law [11, 12], thus that the discrimination threshold increases roughly proportional with speed and so would the likelihood. A logarithmic speed representation would be compatible with the data and our choice of the likelihood. Hence, we transform the linear speed-domain v into a normalized logarithmic domain according to v + v0 vm = f (v) = ln( ) (11) v0 where v0 is a small normalization constant. The normalization is chosen to account for the expected deviation of equal variance behavior at the low end. Surprisingly, it has been found that neurons in the Medial Temporal area (Area MT) of macaque monkeys have speed-tuning curves that are very well approximated by Gaussians of constant width in above normalized logarithmic space [13]. These neurons are known to play a central role in the representation of motion. It seems natural to assume that they are strongly involved in tasks such as our performed psychophysical experiments. 4 Results Figure 4 shows the contrast dependent shift of speed perception and the speed discrimination threshold data for two subjects. Data points connected with a dashed line represent the relative matching speed (v2 /v1 ) for a particular contrast value c2 of the test stimulus as a function of the speed of the reference stimulus. Error bars are the empirical standard deviation of fits to bootstrapped samples of the data. Clearly, low contrast stimuli are perceived to move slower. The effect, however, varies across the tested speed range and tends to become smaller for higher speeds. The relative discrimination thresholds for two different contrasts as a function of speed show that the Weber-Fechner law holds only approximately. The data are in good agreement with other data from the psychophysics literature [1, 11, 8]. For each subject, data from both experiments were used to compute a parametric leastsquares fit according to (3), (4), (7), and (10). In order to test the assumption of a LogNormal likelihood we allowed the standard deviation to be dependent on contrast and speed, thus σ(c, vm ) = g(c)h(vm ). We split the speed range into six bins (subject2: five) and parameterized h(vm ) and the ratio a/b accordingly. Similarly, we parameterized g(c) for the seven contrast values. The resulting fits are superimposed as bold lines in Figure 4. Figure 5 shows the fitted parametric values for g(c) and h(v) (plotted in the linear domain), and the reconstructed prior distribution p(v) transformed back to the linear domain. The approximately constant values for h(v) provide evidence that a LogNormal distribution is an appropriate functional description of the likelihood. The resulting values for g(c) suggest for the likelihood width a roughly exponential decaying dependency on contrast with strong saturation for higher contrasts. discrimination threshold (relative) reference stimulus contrast c1: 0.075 0.5 subject 1 normalized matching speed 1.5 contrast c2 1 0.5 1 10 0.075 0.5 0.79 0.5 0.4 0.3 0.2 0.1 0 10 1 contrast: 1 10 discrimination threshold (relative) normalized matching speed subject 2 1.5 contrast c2 1 0.5 10 1 a 0.5 0.4 0.3 0.2 0.1 10 1 1 b speed of reference stimulus [deg/sec] 10 stimulus speed [deg/sec] Figure 4: Speed discrimination data for two subjects. a) The relative matching speed of a test stimulus with different contrast levels (c2 =[0.05 0.1 0.2 0.4 0.8]) to achieve subjective equality with a reference stimulus (two different contrast values c1 ). b) The relative discrimination threshold for two stimuli with equal contrast (c1,2 =[0.075 0.5]). reconstructed prior subject 1 p(v) [unnormalized] 1 Gaussian Power-Law g(c) 1 h(v) 2 0.9 1.5 0.8 0.1 n=-1.41 0.7 1 0.6 0.01 0.5 0.5 0.4 0.3 1 p(v) [unnormalized] subject 2 10 0.1 1 1 1 1 10 1 10 2 0.9 n=-1.35 0.1 1.5 0.8 0.7 1 0.6 0.01 0.5 0.5 0.4 1 speed [deg/sec] 10 0.3 0 0.1 1 contrast speed [deg/sec] Figure 5: Reconstructed prior distribution and parameters of the likelihood function. The reconstructed prior for both subjects show much heavier tails than a Gaussian (dashed fit), approximately following a power-law function with exponent n ≈ −1.4 (bold line). 5 Conclusions We have proposed a probabilistic framework based on a Bayesian ideal observer and standard signal detection theory. We have derived a likelihood function and prior distribution for the estimator, with a fairly conservative set of assumptions, constrained by psychophysical measurements of speed discrimination and matching. The width of the resulting likelihood is nearly constant in the logarithmic speed domain, and decreases approximately exponentially with contrast. The prior expresses a preference for slower speeds, and approximately follows a power-law distribution, thus has much heavier tails than a Gaussian. It would be interesting to compare the here derived prior distributions with measured true distributions of local image velocities that impinge on the retina. Although a number of authors have measured the spatio-temporal structure of natural images [14, e.g. ], it is clearly difficult to extract therefrom the true prior distribution because of the feedback loop formed through movements of the body, head and eyes. Acknowledgments The authors thank all subjects for their participation in the psychophysical experiments. References [1] P. Thompson. Perceived rate of movement depends on contrast. Vision Research, 22:377–380, 1982. [2] L.S. Stone and P. Thompson. Human speed perception is contrast dependent. Vision Research, 32(8):1535–1549, 1992. [3] A. Yuille and N. Grzywacz. A computational theory for the perception of coherent visual motion. Nature, 333(5):71–74, May 1988. [4] Alan Stocker. Constraint Optimization Networks for Visual Motion Perception - Analysis and Synthesis. PhD thesis, Dept. of Physics, Swiss Federal Institute of Technology, Z¨ rich, Switzeru land, March 2002. [5] Eero Simoncelli. Distributed analysis and representation of visual motion. PhD thesis, MIT, Dept. of Electrical Engineering, Cambridge, MA, 1993. [6] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [7] D.M. Green and J.A. Swets. Signal Detection Theory and Psychophysics. Wiley, New York, 1966. [8] F. H¨ rlimann, D. Kiper, and M. Carandini. Testing the Bayesian model of perceived speed. u Vision Research, 2002. [9] Y. Weiss and D.J. Fleet. Probabilistic Models of the Brain, chapter Velocity Likelihoods in Biological and Machine Vision, pages 77–96. Bradford, 2002. [10] K. Koerding and D. Wolpert. Bayesian integration in sensorimotor learning. 427(15):244–247, January 2004. Nature, [11] Leslie Welch. The perception of moving plaids reveals two motion-processing stages. Nature, 337:734–736, 1989. [12] S. McKee, G. Silvermann, and K. Nakayama. Precise velocity discrimintation despite random variations in temporal frequency and contrast. Vision Research, 26(4):609–619, 1986. [13] C.H. Anderson, H. Nover, and G.C. DeAngelis. Modeling the velocity tuning of macaque MT neurons. Journal of Vision/VSS abstract, 2003. [14] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6:345–358, 1995.

5 0.53601313 193 nips-2004-Theories of Access Consciousness

Author: Michael D. Colagrosso, Michael C. Mozer

Abstract: Theories of access consciousness address how it is that some mental states but not others are available for evaluation, choice behavior, and verbal report. Farah, O’Reilly, and Vecera (1994) argue that quality of representation is critical; Dehaene, Sergent, and Changeux (2003) argue that the ability to communicate representations is critical. We present a probabilistic information transmission or PIT model that suggests both of these conditions are essential for access consciousness. Having successfully modeled data from the repetition priming literature in the past, we use the PIT model to account for data from two experiments on subliminal priming, showing that the model produces priming even in the absence of accessibility and reportability of internal states. The model provides a mechanistic basis for understanding the dissociation of priming and awareness. Philosophy has made many attempts to identify distinct aspects of consciousness. Perhaps the most famous effort is Block’s (1995) delineation of phenomenal and access consciousness. Phenomenal consciousness has to do with “what it is like” to experience chocolate or a pin prick. Access consciousness refers to internal states whose content is “(1) inferentially promiscuous, i.e., poised to be used as a premise in reasoning, (2) poised for control of action, and (3) poised for rational control of speech.” (p. 230) The scientific study of consciousness has exploded in the past six years, and an important catalyst for this explosion has been the decision to focus on the problem of access consciousness: how is it that some mental states but not others become available for evaluation, choice behavior, verbal report, and storage in working memory. Another reason for the recent explosion of consciousness research is the availability of functional imaging techniques to explore differences in brain activation between conscious and unconscious states, as well as the development of clever psychological experiments that show that a stimulus that is not consciously perceived can nonetheless influence cognition, which we describe shortly. 1 Subliminal Priming The phenomena we address utilize an experimental paradigm known as repetition priming. Priming refers to an improvement in efficiency in processing a stimulus item as a result of previous exposure to the item. Efficiency is defined in terms of shorter response times, lower error rates, or both. A typical long-term perceptual priming experiment consists of a study phase during which participants are asked to read aloud a list of words, and a test phase during which participants must name or categorize a series of words, presented one at a time. Reaction time is lower and/or accuracy is higher for test words that were also on the study list. Repetition priming occurs without strategic effort on the part of participants, and therefore appears to be a low level mechanism of learning, which likely serves as the mechanism underlying the refinement of cognitive skills with practice. In traditional studies, priming is supraliminal—the prime is consciously perceived. In the studies we model here, primes are subliminal. Subliminal priming addresses fundamental issues concerning conscious access: How is it that a word or image that cannot be identified, detected, or even discriminated in forced choice can nonetheless influence the processing of a subsequent stimulus word? Answering this question in a computational framework would be a significant advance toward understanding the nature of access consciousness. 2 Models of Conscious and Unconscious Processing In contrast to the wealth of experimental data, and the large number of speculative and philosophical papers on consciousness, concrete computational models are rare. The domain of consciousness is particularly ripe for theoretical perspectives, because it is a significant contribution to simply provide an existence proof of a mechanism that can explain specific experimental data. Ordinarily, a theorist faces skepticism when presenting a model; it often seems that hundreds of alternative, equally plausible accounts must exist. However, when addressing data deemed central to issues of consciousness, simply providing a concrete handle on the phenomena serves to demystify consciousness and bring it into the realm of scientific understanding. We are familiar with only three computational models that address specific experimental data in the domain of consciousness. We summarize these models, and then present a novel model and describe its relationship to the previous efforts. Farah, O’Reilly, and Vecera (1994) were the first to model specific phenomena pertaining to consciousness in a computational framework. The phenomena involve prosopagnosia, a deficit of overt face recognition following brain damage. Nonetheless, prosopagnosia patients exhibit residual covert recognition by a variety of tests. For example, when patients are asked to categorize names as famous or nonfamous, their response times are faster to a famous name when the name is primed by a picture of a semantically related face (e.g., the name “Bill Clinton” when preceded by a photograph of Hillary), despite the fact that they could not identify the related face. Farah et al. model face recognition in a neural network, and show that when the network is damaged, it loses the ability to perform tasks requiring high fidelity representations (e.g., identification) but not tasks requiring only coarse information (e.g., semantic priming). They argue that conscious perception is associated with a certain minimal quality of representation. Dehaene and Naccache (2001) outline a framework based on Baars’ (1989) notion of conscious states as residing in a global workspace. They describe the workspace as a “distributed neural system...with long-distance connectivity that can potentially interconnect multiple specialized brain areas in a coordinated, though variable manner.” (p. 13) Dehaene, Sergent, and Changeaux (2003) implement this framework in a complicated architecture of integrate-and-fire neurons and show that the model can qualitatively account for the attentional blink phenomenon. The attentional blink is observed in experiments where participants are shown a rapid series of stimuli, which includes two targets (T1 and T2). If T2 appears shortly after T1, the ability to report T2 drops, as if attention is distracted. Dehane et al. explain this phenomenon as follows. When T1 is presented, its activation propagates to frontal cortical areas (the global workspace). Feedback connections lead to a resonance between frontal and posterior areas, which strengthen T1 but block T2 from entering the workspace. If the T1-T2 lag is sufficiently great, habituation of T1 sufficiently weakens the representation such that T2 can enter the workspace and suppress T1. In this account, conscious access is achieved via resonance between posterior and frontal areas. Although the Farah et al. and Dehaene et al. models might not seem to have much in common, they both make claims concerning what is required to achieve functional connectivity between perceptual and response systems. Farah et al. focus on aspects of the representation; Dehaene et al. focus on a pathway through which representations can be communicated. These two aspects are not incompatible, and in fact, a third model incorporates both. Mathis and Mozer (1996) describe an architecture with processing modules for perceptual and response processes, implemented as attractor neural nets. They argue that in order for a representation in some perceptual module to be assured of influencing a response module, (a) it must have certain characteristics–temporal persistence and well-formedness– which is quite similar to Farah et al.’s notion of quality, and (b) the two modules must be interconnected—which is the purpose of Dehaene et al.’s global workspace. The model has two limitations that restrict its value as a contemporary account of conscious access. First, it addressed classical subliminal priming data, but more reliable data has recently been reported. Second, like the other two models, Mathis and Mozer used a complex neural network architecture with arbitrary assumptions built in, and the sensitivity of the model’s behavior to these assumptions is far from clear. In this paper, we present a model that embodies the same assumptions as Mathis and Mozer, but overcomes its two limitations, and explains subliminal-priming data that has yet to be interpreted via a computational model. 3 The Probabilistic Information Transmission (PIT) Framework Our model is based on the probabilistic information transmission or PIT framework of Mozer, Colagrosso, and Huber (2002, 2003). The framework characterizes the transmission of information from perceptual to response systems, and how the time course of information transmission changes with experience (i.e., priming). Mozer et al. used this framework to account for a variety of facilitation effects from supraliminal repetition priming. The framework describes cognition in terms of a collection of information-processing pathways, and supposes that any act of cognition involves coordination among multiple pathways. For example, to model a letter-naming task where a letter printed in upper or lower case is presented visually and the letter must be named, the framework would assume a perceptual pathway that maps the visual input to an identity representation, and a response pathway that maps a identity representation to a naming response. The framework is formalized as a probabilistic model: the pathway input and output are random variables and microinference in a pathway is carried out by Bayesian belief revision. The framework captures the time course of information processing for a single experimental trial. To elaborate, consider a pathway whose input at time t is a discrete random variable, denoted X(t), which can assume values x1 , x2 , x3 , . . . , xnx corresponding to alternative input states. Similarly, the output of the pathway at time t is a discrete random variable, denoted Y (t), which can assume values y1 , y2 , y3 , . . . , yny . For example, in the letter-naming task, the input to the perceptual pathway would be one of nx = 52 visual patterns corresponding to the upper- and lower-case letters of the alphabet, and the output is one of ny = 26 letter identities. To present a particular input alternative, say xi , to the model for T time steps, we specify X(t) = xi for t = 1 . . . T , and allow the model to compute P(Y (t) | X(1) . . . X(t)). A pathway is modeled as a dynamic Bayes network; the minimal version of the model used in the present simulations is simply a hidden Markov model, where the X(t) are observations and the Y (t) are inferred state (see Figure 1a). In typical usage, an HMM is presented with a sequence of distinct inputs, whereas we maintain the same input for many successive time steps; and an HMM transitions through a sequence of distinct hidden states, whereas we attempt to converge with increasing confidence on a single state. Figure 1b illustrates the time course of inference in a single pathway with 52 input and 26 output alternatives and two-to-one associations. The solid line in the Figure shows, as a function of time t, P(Y (t) = yi | X(1) = x2i . . . X(t) = x2i ), i.e., the probability that input i (say, the visual pattern of an upper case O) will produce its target output (the letter identity). Evidence for the target output accumulates gradually over time, yielding a speed-accuracy curve that relates the number of iterations to the accuracy of identification. Y0 Y1 X1 Y2 X2 P(Output) 1 Y3 X3 (a) (b) 0.8 0.6 O 0.4 0.2 0 Q Time Figure 1: (a) basic pathway architecture—a hidden Markov model; (b) time course of inference in a pathway when the letter O is presented, causing activation of both O and the visually similar Q. The exact shape of the speed-accuracy curve—the pathway dynamics—are determined by three probability distributions, which embody the knowledge and past experience of the model. First, P(Y (0)) is the prior distribution over outputs in the absence of any information about the input. Second, P(Y (t) | Y (t − 1)) characterizes how the pathway output evolves over time. We assume the transition probability matrix serves as a memory with diffusion, i.e., P(Y (t) = yi |Y (t − 1) = yj ) = (1 − β)δij + βP(Y (0) = yi ), where β is the diffusion constant and δij is the Kronecker delta. Third, P(X(t) | Y (t)) characterizes the strength of association between inputs and outputs. The greater the association strength, the more rapidly that information about X will be communicated to Y . We parameterize this distribution as P(X(t) = xi |Y (t) = yj ) ∼ 1 + k γik αkj , where αij indicates the frequency of experience with the association between states xi and yj , and γik specifies the similarity between states xi and xk . (Although the representation of states is localist, the γ terms allow us to design in the similarity structure inherent in a distributed representation.) These association strengths are highly constrained by the task structure and the similarity structure and familiarity of the inputs. Fundamental to the framework is the assumption that with each experience, a pathway becomes more efficient at processing an input. Efficiency is reflected by a shift in the speedaccuracy curve to the left. In Mozer, Colagrosso, and Huber (2002, 2003), we propose two distinct mechanisms to model phenomena of supraliminal priming. First, the association frequencies, αij , are increased following a trial in which xi leads to activation of yj , resulting in more efficient transmission of information, corresponding to an increased slope of the solid line in Figure 1b. The increase is Hebbian, based on the maximum activation achieved by xi and yj : ∆αij = η maxt P(X(t) = xi )P(Y (t) = yj ), where η is a step size. Second, the priors, which serve as a model of the environment, are increased to indicate a greater likelihood of the same output occurring again in the future. In modeling data from supraliminal priming, we found that the increases to association frequencies are long lasting, but the increases to the priors decay over the course of a few minutes or a few trials. As a result, the prior updating does not play into the simulation we report here; we refer the reader to Mozer, Colagrosso, and Huber (2003) for details. 4 Access Consciousness and PIT We have described the operation of a single pathway, but to model any cognitive task, we require a series of pathways in cascade. For a simple choice task, we use a percpetual pathway cascaded to a response pathway. The interconnection between the pathways is achieved by copying the output of the perceptual pathway, Y p (t), to the input of the response pathway, X r (t), at each time t. This multiple-pathway architecture allows us to characterize the notion of access consciousness. Considering the output of the perceptual pathway, access is achieved when: (1) the output representation is sufficient to trigger the correct behavior in the response pathway, and (2) the perceptual and response pathways are functionally interconnected. In more general terms, access for a perceptual pathway output requires that these two condi- tions be met not just for a specific response pathway, but for arbitrary response pathways (e.g., pathways for naming, choice, evaluation, working memory, etc.). In Mozer and Colagrosso (in preparation) we characterize the sufficiency requirements of condition 1; they involve a representation of low entropy that stays active for long enough that the representation can propagate to the next pathway. As we will show, a briefly presented stimulus fails to achieve a representation that supports choice and naming responses. Nonetheless, the stimulus evokes activity in the perceptual pathway. Because perceptual priming depends on the magnitude of the activation in the perceptual pathway, not on the activation being communicated to response pathways, the framework is consistent with the notion of priming occurring in the absence of awareness. 4.1 Simulation of Bar and Biederman (1998) Bar and Biederman (1998) presented a sequence of masked line drawings of objects and asked participants to name the objects, even if they had to guess. If the guess was incorrect, participants were required to choose the object name from a set of four alternatives. Unbeknownst to the participant, some of the drawings in the series were repeated, and Bar and Biederman were interested in whether participants would benefit from the first presentation even if it could not be identified. The repeated objects could be the same or a different exemplar of the object, and it could appear in either the same or a different display position. Participants were able to name 13.5% of drawings on presentation 1, but accuracy jumped to 34.5% on presentation 2. Accuracy did improve, though not as much, if the same shape was presented in a different position, but not if a different drawing of the same object was presented, suggesting a locus of priming early in the visual stream. The improvement in accuracy is not due to practice in general, because accuracy rose only 4.0% for novel control objects over the course of the experiment. The priming is firmly subliminal, because participants were not only unable to name objects on the first presentation, but their fouralternative forced choice (4AFC) performance was not much above chance (28.5%). To model these phenomena, we created a response pathway with fifty states representing names of objects that are used in the experiment, e.g., chair and lamp. We also created a perceptual pathway with states representing visual patterns that correspond to the names in the response pathway. Following the experimental design, every object identity was instantiated in two distinct shapes, and every shape could be in one of nine different visualfield positions, leading to 900 distinct states in the perceptual pathway to model the possible visual stimuli. The following parameters were fit to the data. If two perceptual states, xi and xk are the same shape in different positions, they are assigned a similarity coefficient γik = 0.95; all other similarity coefficients are zero. The association frequency, α, for valid associations in the perceptual pathway was 22, and the response pathway 18. Other parameters were β p = .05, β r = .01, and η = 1.0. The PIT model achieves a good fit to the human experimental data (Figure 2). Specifically, priming is greatest for the same shape in the same position, some priming occurs for the same shape in a different position, and no substantial priming occurs for the different shape. Figure 3a shows the time course of activation of a stimulus representation in the perceptual pathway when the stimulus is presented for 50 iterations, on both the first and third presentations. The third presentation was chosen instead of the second to make the effect of priming clearer. Even though a shape cannot be named on the first presentation, partial information about the shape may nonetheless be available for report. The 4AFC test of Bar and Biederman provides a more sensitive measure of residual stimulus information. In past work, we modeled forced-choice tasks using a response pathway with only the alternatives under consideration. However, in this experiment, forced-choice performance must be estimated conditional on incorrect naming. In PIT framework, we achieve this using naming and 40 40 First Block Second Block 30 25 20 15 10 5 First Block 35 Percent Correct Naming Percent Correct Naming 35 Second Block 30 25 20 15 10 5 0 0 Control Objects Prime SHAPE: Same Objects POSITION: Same Same Different Different Different Second Same Different Control Control Objects Prime SHAPE: Same Objects POSITION: Same Same Different Different Different Second Same Different Control Figure 2: (left panel) Data from Bar and Biederman (1998) (right panel) Simulation of PIT. White bar: accuracy on first presentation of a prime object. Black bars: the accuracy when the object is repeated, either with the same or different shape, and in the same or different position. Grey bars: accuracy for control objects at the beginning and the end of the experiment. forced-choice output pathways having output distributions N (t) and F (t), which are linked via the perceptual state, Y p (t). F (t) must be reestimated with the evidence that N (t) is not the target state. This inference problem is intractable. We therefore used a shortcut in which a single response pathway is used, augmented with a simple three-node belief net (Figure 3b) to capture the dependence between naming and forced choice. The belief net has a response pathway node Y r (t) connected to F (t) and N (t), with conditional distribution P (N (t) = ni |Y r (t) = yj ) = θδij + (1 − θ)/|Y r |, and an analogous distribution for P (F (t) = fi |Y r (t) = yj ). The free parameter θ determines how veridically naming and forced-choice actions reflect response-pathway output. Over a range of θ, θ < 1, the model obtains forced-choice performance near chance on the first presentation when the naming response is incorrect. For example, with θ = 0.72, the model produces a forced-choice accuracy on presentation 1 of 26.1%. (Interestingly, the model also produces below chance performance on presentation 2 if the object is not named correctly—23.5%—which is also found in the human data—20.0%.) Thus, by the stringent criterion of 4AFC, the model shows no access consciousness, and therefore illustrates a dissociation between priming and access consciousness. In our simulation, we followed the procedure of Bar and Biederman by including distractor alternatives with visual and semantic similarity to the target. These distractors are critical: with unrelated distractors, the model’s 4AFC performance is significantly above chance, illustrating that a perceptual representation can be adequate to support some responses but not others, as Farah et al. (1994) also argued. 4.2 Simulation of Abrams and Greenwald (2000) During an initial phase of the experiment, participants categorized 24 clearly visible target words as pleasant (e.g., HUMOR) or unpleasant (e.g., SMUT). They became quite familiar with the task by categorizing each word a total of eight times. In a second phase, participants were asked to classify the same targets and were given a response deadline to induce errors. The targets were preceded by masked primes that could not be identified. Of interest is the effective valence (or EV) of the target for different prime types, defined as the error rate difference between unpleasant and pleasant targets. A positive (negative) EV indicates that responses are biased toward a pleasant (unpleasant) interpretation by the prime. As one would expect, pleasant primes resulted in a positive EV, unpleasant primes in a negative EV. Of critical interest is the finding that a nonword prime formed by recombining two pleasant targets (e.g., HULIP from HUMOR and TULIP) or unpleasant targets (e.g., BIUT from BILE and SMUT ) also served to bias the targets. More surprising, a positive EV resulted from unpleasant prime words formed by recombining two pleasant targets (TUMOR from TULIP and HUMOR ), indicating that subliminal priming arises from word fragments, not words as unitary entities, and providing further evidence for an early locus of subliminal priming. Note that the results depend critically on the first phase of the experiment, which gave participants extensive practice on a relatively small set of words that were then used as and recombined to form primes. Words not studied in the first phase (orphans) provided Probability 0.6 0.5 0.4 0.3 0.2 0.1 0 object, first presentation object, third presentation different object N(t) F(t) Yr(t) 1 50 1000 (a) (b) Time (msec) Figure 3: (a) Activation of the perceptual representation in PIT as a function of processing iterations Effective Valence on the first (thin solid line) and third (thick solid line) presentations of target. (b) Bayes net for performing 4AFC conditional on incorrect naming response. 0.4 Experiment Model 0.3 0.2 0.1 0 targets hulip-type tumor-type orphans Figure 4: Effective valence of primes in the Abrams and Greenwald (2000) experiment for human subjects (black bars) and PIT model (grey bars). HULIP-type primes are almost as strong as target repetitions, and TUMOR-type primes have a positive valence, contrary to the meaning of the word. no significant EV effect when used as primes. In this simulation, we used a three pathway model: a perceptual pathway that maps visual patterns to orthography with 200 input states corresponding both to words, nonwords, and nonword recombinations of words; a semantic pathway that maps to 100 distinct lexical/semantic states; and a judgement pathway that maps to two responses, pleasant and unpleasant. In the perceptual pathway, similarity structure was based on letter overlap, so that HULIP was similar to both TULIP and HUMOR, with γ = 0.837. No similarity was assumed in the semantic state representation; consistent with the previous simulation, β p = .05, β s = .01, β j = .01, and η = .01. At the outset of the simulation, α frequencies for correct associations were 15, 19, and 25 in the perceptual, semantic, and judgement pathways. The initial phase of the experiment was simulated by repeated supraliminal presentation of words, which increased the association frequencies in all three pathways through the ∆αij learning rule. Long-term supraliminal priming is essential in establishing the association strengths, as we’ll explain. Short-term subliminal priming also plays a key role in the experiment. During the second phase of the experiment, residual activity from the prime—primarily in the judgement pathway—biases the response to the target. Residual activation of the prime is present even if the representation of the prime does not reach sufficient strength that it could be named or otherwise reported. The outcome of the simulation is consistent with the human data (Figure 4). When a HULIP -type prime is presented, HUMOR and TULIP become active in the semantic pathway because of their visual similarity to HULIP. Partial activation of these two practiced words pushes the judgement pathway toward a pleasant response, resulting in a positive EV. When a TUMOR-type prime is presented, three different words become active in the semantic pathway: HUMOR, TULIP, and TUMOR itself. Although TUMOR is more active, it was not one of the words studied during the initial phase of the experiment, and as a result, it has a relatively weak association to the unpleasant judgement, in contrast to the other two words which have strong associations to the pleasant judgement. Orphan primes have little effect because they were not studied during the initial phase of the experiment, and consequently their association to pleasant and unpleasant judgements is also weak. In summary, activation of the prime along a critical, well-practiced pathway may not be sufficient to support an overt naming response, yet it may be sufficient to bias the processing of the immediately following target. 5 Discussion An important contribution of this work has been to demonstrate that specific experimental results relating to access consciousness and subliminal priming can be interpreted in a concrete computational framework. By necessity, the PIT framework, which we previously used to model supraliminal priming data, predicts the existence of subliminal priming, because the mechanisms giving rise to priming depend on degree of activation of a representation, whereas the processes giving rise to access consciousness also depend on the temporal persistence of a representation. Another contribution of this work has been to argue that two previous computational models each tell only part of the story. Farah et al. argue that quality of representation is critical; Dehaene et al. argue that pathways to communicate representations is critical. The PIT framework argues that both of these features are necessary for access consciousness. Although the PIT framework is not completely developed, it nonetheless makes a clear prediction: that subliminal priming is can never be stronger than supraliminal priming, because the maximal activation of subliminal primes is never greater than that of supraliminal primes. One might argue that many theoretical frameworks might predict the same, but no other computational model is sufficiently well developed—in terms of addressing both priming and access consciousness—to make this prediction. In its current stage of development, a weakness of the PIT framework is that it is silent as to how perceptual and response pathways become flexibly interconnected based on task demands. However, the PIT framework is not alone in failing to address this critical issue: The Dehaene et al. model suggests that once a representation enters the global workspace, all response modules can access it, but the model does not specify how the appropriate perceptual module wins the competition to enter the global workspace, or how the appropriate response module is activated. Clearly, flexible cognitive control structures that perform these functions are intricately related to mechanisms of consciousness. Acknowledgments This research was supported by NIH/IFOPAL R01 MH61549–01A1. References Abrams, R. L., & Greenwald, A. G. (2000). Parts outweigh the whole (word) in unconscious analysis of meaning. Psychological Science, 11(2), 118–124. Baars, B. (1989). A cognitive theory of consciousness. Cambridge: Cambridge University Press. Bar, M., & Biederman, I. (1998). Subliminal visual priming. Psychological Science, 9(6), 464–468. Block, N. (1995). On a confusion about a function of consciousness. Brain and Behavioral Sciences, 18(2), 227–247. Dehaene, S., & Naccache, L. (2001). Towards a cognitive neuroscience of consciousness: basic evidence and a workspace framework. Cognition, 79, 1–37. Dehaene, S., Sergent, C., & Changeux, J.-P. (2003). A neuronal network model linking subjective reports and objective physiological data during conscious perception. Proceedings of the National Academy of Sciences, 100, 8520–8525. Farah, M. J., O’Reilly, R. C., & Vecera, S. P. (1994). Dissociated overt and covert recognition as an emergent property of a lesioned neural network. Psychological Review, 100, 571–588. Mathis, D. W., & Mozer, M. C. (1996). Conscious and unconscious perception: a computational theory. In G. Cottrell (Ed.), Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society (pp. 324–328). Hillsdale, NJ: Erlbaum & Associates. Mozer, M. C., Colagrosso, M. D., & Huber, D. E. (2002). A rational analysis of cognitive control in a speeded discrimination task. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Mozer, M. C., Colagrosso, M. D., & Huber, D. E. (2003). Mechanisms of long-term repetition priming and skill refinement: A probabilistic pathway model. In Proceedings of the TwentyFifth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum Associates.

6 0.53411329 33 nips-2004-Brain Inspired Reinforcement Learning

7 0.51041341 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

8 0.42674306 35 nips-2004-Chemosensory Processing in a Spiking Model of the Olfactory Bulb: Chemotopic Convergence and Center Surround Inhibition

9 0.39692897 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

10 0.36165711 28 nips-2004-Bayesian inference in spiking neurons

11 0.34383133 157 nips-2004-Saliency-Driven Image Acuity Modulation on a Reconfigurable Array of Spiking Silicon Neurons

12 0.32057059 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

13 0.31388363 155 nips-2004-Responding to Modalities with Different Latencies

14 0.31007418 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

15 0.3085269 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern

16 0.27943543 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

17 0.26886466 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

18 0.26340392 148 nips-2004-Probabilistic Computation in Spiking Populations

19 0.25824881 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes

20 0.2578136 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.305), (13, 0.079), (15, 0.093), (26, 0.074), (31, 0.027), (33, 0.115), (35, 0.072), (39, 0.013), (50, 0.049), (52, 0.012), (71, 0.012), (82, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78821743 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

Author: Angela J. Yu, Peter Dayan

Abstract: We study the synthesis of neural coding, selective attention and perceptual decision making. A hierarchical neural architecture is proposed, which implements Bayesian integration of noisy sensory input and topdown attentional priors, leading to sound perceptual discrimination. The model offers an explicit explanation for the experimentally observed modulation that prior information in one stimulus feature (location) can have on an independent feature (orientation). The network’s intermediate levels of representation instantiate known physiological properties of visual cortical neurons. The model also illustrates a possible reconciliation of cortical and neuromodulatory representations of uncertainty. 1

2 0.70242238 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

Author: Johannes Mohr, Klaus Obermayer

Abstract: The standard approach to the classification of objects is to consider the examples as independent and identically distributed (iid). In many real world settings, however, this assumption is not valid, because a topographical relationship exists between the objects. In this contribution we consider the special case of image segmentation, where the objects are pixels and where the underlying topography is a 2D regular rectangular grid. We introduce a classification method which not only uses measured vectorial feature information but also the label configuration within a topographic neighborhood. Due to the resulting dependence between the labels of neighboring pixels, a collective classification of a set of pixels becomes necessary. We propose a new method called ’Topographic Support Vector Machine’ (TSVM), which is based on a topographic kernel and a self-consistent solution to the label assignment shown to be equivalent to a recurrent neural network. The performance of the algorithm is compared to a conventional SVM on a cell image segmentation task. 1

3 0.53190559 28 nips-2004-Bayesian inference in spiking neurons

Author: Sophie Deneve

Abstract: We propose a new interpretation of spiking neurons as Bayesian integrators accumulating evidence over time about events in the external world or the body, and communicating to other neurons their certainties about these events. In this model, spikes signal the occurrence of new information, i.e. what cannot be predicted from the past activity. As a result, firing statistics are close to Poisson, albeit providing a deterministic representation of probabilities. We proceed to develop a theory of Bayesian inference in spiking neural networks, recurrent interactions implementing a variant of belief propagation. Many perceptual and motor tasks performed by the central nervous system are probabilistic, and can be described in a Bayesian framework [4, 3]. A few important but hidden properties, such as direction of motion, or appropriate motor commands, are inferred from many noisy, local and ambiguous sensory cues. These evidences are combined with priors about the sensory world and body. Importantly, because most of these inferences should lead to quick and irreversible decisions in a perpetually changing world, noisy cues have to be integrated on-line, but in a way that takes into account unpredictable events, such as a sudden change in motion direction or the appearance of a new stimulus. This raises the question of how this temporal integration can be performed at the neural level. It has been proposed that single neurons in sensory cortices represent and compute the log probability that a sensory variable takes on a certain value (eg Is visual motion in the neuron’s preferred direction?) [9, 7]. Alternatively, to avoid normalization issues and provide an appropriate signal for decision making, neurons could represent the log probability ratio of a particular hypothesis (eg is motion more likely to be towards the right than towards the left) [7, 6]. Log probabilities are convenient here, since under some assumptions, independent noisy cues simply combine linearly. Moreover, there are physiological evidence for the neural representation of log probabilities and log probability ratios [9, 6, 7]. However, these models assume that neurons represent probabilities in their firing rates. We argue that it is important to study how probabilistic information are encoded in spikes. Indeed, it seems spurious to marry the idea of an exquisite on-line integration of noisy cues with an underlying rate code that requires averaging on large populations of noisy neurons and long periods of time. In particular, most natural tasks require this integration to take place on the time scale of inter-spike intervals. Spikes are more efficiently signaling events ∗ Institute of Cognitive Science, 69645 Bron, France than analog quantities. In addition, a neural theory of inference with spikes will bring us closer to the physiological level and generate more easily testable predictions. Thus, we propose a new theory of neural processing in which spike trains provide a deterministic, online representation of a log-probability ratio. Spikes signals events, eg that the log-probability ratio has exceeded what could be predicted from previous spikes. This form of coding was loosely inspired by the idea of ”energy landscape” coding proposed by Hinton and Brown [2]. However, contrary to [2] and other theories using rate-based representation of probabilities, this model is self-consistent and does not require different models for encoding and decoding: As output spikes provide new, unpredictable, temporally independent evidence, they can be used directly as an input to other Bayesian neurons. Finally, we show that these neurons can be used as building blocks in a theory of approximate Bayesian inference in recurrent spiking networks. Connections between neurons implement an underlying Bayesian network, consisting of coupled hidden Markov models. Propagation of spikes is a form of belief propagation in this underlying graphical model. Our theory provides computational explanations of some general physiological properties of cortical neurons, such as spike frequency adaptation, Poisson statistics of spike trains, the existence of strong local inhibition in cortical columns, and the maintenance of a tight balance between excitation and inhibition. Finally, we discuss the implications of this model for the debate about temporal versus rate-based neural coding. 1 Spikes and log posterior odds 1.1 Synaptic integration seen as inference in a hidden Markov chain We propose that each neuron codes for an underlying ”hidden” binary variable, xt , whose state evolves over time. We assume that xt depends only on the state at the previous time step, xt−dt , and is conditionally independent of other past states. The state xt can switch 1 from 0 to 1 with a constant rate ron = dt limdt→0 P (xt = 1|xt−dt = 0), and from 1 to 0 with a constant rate roff . For example, these transition rates could represent how often motion in a preferred direction appears the receptive field and how long it is likely to stay there. The neuron infers the state of its hidden variable from N noisy synaptic inputs, considered to be observations of the hidden state. In this initial version of the model, we assume that these inputs are conditionally independent homogeneous Poisson processes, synapse i i emitting a spike between time t and t + dt (si = 1) with constant probability qon dt if t i xt = 1, and another constant probability qoff dt if xt = 0. The synaptic spikes are assumed to be otherwise independent of previous synaptic spikes, previous states and spikes at other synapses. The resulting generative model is a hidden Markov chain (figure 1-A). However, rather than estimating the state of its hidden variable and communicating this estimate to other neurons (for example by emitting a spike when sensory evidence for xt = 1 goes above a threshold) the neuron reports and communicates its certainty that the current state is 1. This certainty takes the form of the log of the ratio of the probability that the hidden state is 1, and the probability that the state is 0, given all the synaptic inputs P (xt =1|s0→t ) received so far: Lt = log P (xt =0|s0→t ) . We use s0→t as a short hand notation for the N synaptic inputs received at present and in the past. We will refer to it as the log odds ratio. Thanks to the conditional independencies assumed in the generative model, we can compute this Log odds ratio iteratively. Taking the limit as dt goes to zero, we get the following differential equation: ˙ L = ron 1 + e−L − roff 1 + eL + i wi δ(si − 1) − θ t B. A. xt ron .roff dt qon , qoff st xt ron .roff i t st dt s qon , qoff qon , qoff st dt xt j st Ot It Gt Ot Lt t t dt C. E. 2 0 -2 -4 D. 500 1000 1500 2000 2500 2 3000 Count Log odds 4 20 Lt 0 -2 0 500 1000 1500 2000 2500 Time Ot 3000 0 200 400 600 ISI Figure 1: A. Generative model for the synaptic input. B. Schematic representation of log odds ratio encoding and decoding. The dashed circle represents both eventual downstream elements and the self-prediction taking place inside the model neuron. A spike is fired only when Lt exceeds Gt . C. One example trial, where the state switches from 0 to 1 (shaded area) and back to 0. plain: Lt , dotted: Gt . Black stripes at the top: corresponding spikes train. D. Mean Log odds ratio (dark line) and mean output firing rate (clear line). E. Output spike raster plot (1 line per trial) and ISI distribution for the neuron shown is C. and D. Clear line: ISI distribution for a poisson neuron with the same rate. wi , the synaptic weight, describe how informative synapse i is about the state of the hidden i qon variable, e.g. wi = log qi . Each synaptic spike (si = 1) gives an impulse to the log t off odds ratio, which is positive if this synapse is more active when the hidden state if 1 (i.e it increases the neuron’s confidence that the state is 1), and negative if this synapse is more active when xt = 0 (i.e it decreases the neuron’s confidence that the state is 1). The bias, θ, is determined by how informative it is not to receive any spike, e.g. θ = i i i qon − qoff . By convention, we will consider that the ”bias” is positive or zero (if not, we need simply to invert the status of the state x). 1.2 Generation of output spikes The spike train should convey a sparse representation of Lt , so that each spike reports new information about the state xt that is not redundant with that reported by other, preceding, spikes. This proposition is based on three arguments: First, spikes, being metabolically expensive, should be kept to a minimum. Second, spikes conveying redundant information would require a decoding of the entire spike train, whereas independent spike can be taken into account individually. And finally, we seek a self consistent model, with the spiking output having a similar semantics to its spiking input. To maximize the independence of the spikes (conditioned on xt ), we propose that the neuron fires only when the difference between its log odds ratio Lt and a prediction Gt of this log odds ratio based on the output spikes emitted so far reaches a certain threshold. Indeed, supposing that downstream elements predicts Lt as best as they can, the neuron only needs to fire when it expects that prediction to be too inaccurate (figure 1-B). In practice, this will happen when the neuron receives new evidence for xt = 1. Gt should thereby follow the same dynamics as Lt when spikes are not received. The equation for Gt and the output Ot (Ot = 1 when an output spike is fired) are given by: ˙ G = Ot = ron 1 + e−L − roff 1 + eL + go δ(Ot − 1) go 1. when Lt > Gt + , 0 otherwise, 2 (1) (2) Here go , a positive constant, is the only free parameter, the other parameters being constrained by the statistics of the synaptic input. 1.3 Results Figure 1-C plots a typical trial, showing the behavior of L, G and O before, during and after presentation of the stimulus. As random synaptic inputs are integrated, L fluctuates and eventually exceeds G + 0.5, leading to an output spike. Immediately after a spike, G jumps to G + go , which prevents (except in very rare cases) a second spike from immediately following the first. Thus, this ”jump” implements a relative refractory period. However, ron G decays as it tends to converge back to its stable level gstable = log roff . Thus L eventually exceeds G again, leading to a new spike. This threshold crossing happens more often during stimulation (xt = 1) as the net synaptic input alters to create a higher overall level of certainty, Lt . Mean Log odds ratio and output firing rate ¯ The mean firing rate Ot of the Bayesian neuron during presentation of its preferred stimulus (i.e. when xt switches from 0 to 1 and back to 0) is plotted in figure 1-D, together with the ¯ mean log posterior ratio Lt , both averaged over trials. Not surprisingly, the log-posterior ratio reflects the leaky integration of synaptic evidence, with an effective time constant that depends on the transition probabilities ron , roff . If the state is very stable (ron = roff ∼ 0), synaptic evidence is integrated over almost infinite time periods, the mean log posterior ratio tending to either increase or decrease linearly with time. In the example in figure 1D, the state is less stable, so ”old” synaptic evidence are discounted and Lt saturates. ¯ In contrast, the mean output firing rate Ot tracks the state of xt almost perfectly. This is because, as a form of predictive coding, the output spikes reflect the new synaptic i evidence, It = i δ(st − 1) − θ, rather than the log posterior ratio itself. In particular, the mean output firing rate is a rectified linear function of the mean input, e. g. + ¯ ¯ wi q i −θ . O= 1I= go i on(off) Analogy with a leaky integrate and fire neuron We can get an interesting insight into the computation performed by this neuron by linearizing L and G around their mean levels over trials. Here we reduce the analysis to prolonged, statistically stable periods when the state is constant (either ON or OFF). In this case, the ¯ ¯ mean level of certainty L and its output prediction G are also constant over time. We make the rough approximation that the post spike jump, go , and the input fluctuations are small ¯ compared to the mean level of certainty L. Rewriting Vt = Lt − Gt + go 2 as the ”membrane potential” of the Bayesian neuron: ˙ V = −kL V + It − ∆go − go Ot ¯ ¯ ¯ where kL = ron e−L + roff eL , the ”leak” of the membrane potential, depends on the overall ¯ level of certainty. ∆go is positive and a monotonic increasing function of go . A. s t1 dt s t1 s t1 dt B. C. x t1 x t3 dt x t3 x t3 dt x t1 x t1 x t1 x t2 x t3 x t1 … x tn x t3 x t2 … x tn … dt dt Lx2 D. x t2 dt s t2 dt x t2 s t2 x t2 dt s t2 dt Log odds 10 No inh -0.5 -1 -1 -1.5 -2 5 Feedback 500 1000 1500 2000 Tiger Stripes 0 -5 -10 500 1000 1500 2000 2500 Time Figure 2: A. Bayesian causal network for yt (tiger), x1 (stripes) and x2 (paws). B. A nett t work feedforward computing the log posterior for x1 . C. A recurrent network computing t the log posterior odds for all variables. D. Log odds ratio in a simulated trial with the net2 1 1 work in C (see text). Thick line: Lx , thin line: Lx , dash-dotted: Lx without inhibition. t t t 2 Insert: Lx averaged over trials, showing the effect of feedback. t The linearized Bayesian neuron thus acts in its stable regime as a leaky integrate and fire (LIF) neuron. The membrane potential Vt integrates its input, Jt = It − ∆go , with a leak kL . The neuron fires when its membrane potential reaches a constant threshold go . After ¯ each spikes, Vt is reset to 0. Interestingly, for appropriately chosen compression factor go , the mean input to the lin¯ ¯ earized neuron J = I − ∆go ≈ 0 1 . This means that the membrane potential is purely driven to its threshold by input fluctuations, or a random walk in membrane potential. As a consequence, the neuron’s firing will be memoryless, and close to a Poisson process. In particular, we found Fano factor close to 1 and quasi-exponential ISI distribution (figure 1E) on the entire range of parameters tested. Indeed, LIF neurons with balanced inputs have been proposed as a model to reproduce the statistics of real cortical neurons [8]. This balance is implemented in our model by the neuron’s effective self-inhibition, even when the synaptic input itself is not balanced. Decoding As we previously said, downstream elements could predict the log odds ratio Lt by computing Gt from the output spikes (Eq 1, fig 1-B). Of course, this requires an estimate of the transition probabilities ron , roff , that could be learned from the observed spike trains. However, we show next that explicit decoding is not necessary to perform bayesian inference in spiking networks. Intuitively, this is because the quantity that our model neurons receive and transmit, eg new information, is exactly what probabilistic inference algorithm propagate between connected statistical elements. 1 ¯ Even if go is not chosen optimally, the influence of the drift J is usually negligible compared to the large fluctuations in membrane potential. 2 Bayesian inference in cortical networks The model neurons, having the same input and output semantics, can be used as building blocks to implement more complex generative models consisting of coupled Markov chains. Consider, for example, the example in figure 2-A. Here, a ”parent” variable x1 t (the presence of a tiger) can cause the state of n other ”children” variables ([xk ]k=2...n ), t of whom two are represented (the presence of stripes,x2 , and motion, x3 ). The ”chilt t dren” variables are Bayesian neurons identical to those described previously. The resulting bayesian network consist of n + 1 coupled hidden Markov chains. Inference in this architecture corresponds to computing the log posterior odds ratio for the tiger, x1 , and the log t posterior of observing stripes or motion, ([xk ]k=2...n ), given the synaptic inputs received t by the entire network so far, i.e. s2 , . . . , sk . 0→t 0→t Unfortunately, inference and learning in this network (and in general in coupled Markov chains) requires very expensive computations, and cannot be performed by simply propagating messages over time and among the variable nodes. In particular, the state of a child k variable xt depends on xk , sk , x1 and the state of all other children at the previous t t t−dt time step, [xj ]2

4 0.52594328 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

Author: Rajesh P. Rao

Abstract: There is growing evidence from psychophysical and neurophysiological studies that the brain utilizes Bayesian principles for inference and decision making. An important open question is how Bayesian inference for arbitrary graphical models can be implemented in networks of spiking neurons. In this paper, we show that recurrent networks of noisy integrate-and-fire neurons can perform approximate Bayesian inference for dynamic and hierarchical graphical models. The membrane potential dynamics of neurons is used to implement belief propagation in the log domain. The spiking probability of a neuron is shown to approximate the posterior probability of the preferred state encoded by the neuron, given past inputs. We illustrate the model using two examples: (1) a motion detection network in which the spiking probability of a direction-selective neuron becomes proportional to the posterior probability of motion in a preferred direction, and (2) a two-level hierarchical network that produces attentional effects similar to those observed in visual cortical areas V2 and V4. The hierarchical model offers a new Bayesian interpretation of attentional modulation in V2 and V4. 1

5 0.51875639 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

Author: Jochen Triesch

Abstract: This paper explores the computational consequences of simultaneous intrinsic and synaptic plasticity in individual model neurons. It proposes a new intrinsic plasticity mechanism for a continuous activation model neuron based on low order moments of the neuron’s firing rate distribution. The goal of the intrinsic plasticity mechanism is to enforce a sparse distribution of the neuron’s activity level. In conjunction with Hebbian learning at the neuron’s synapses, the neuron is shown to discover sparse directions in the input. 1

6 0.51678711 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees

7 0.51456898 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

8 0.51410985 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern

9 0.51333052 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

10 0.51280129 69 nips-2004-Fast Rates to Bayes for Kernel Machines

11 0.51230615 131 nips-2004-Non-Local Manifold Tangent Learning

12 0.51143461 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

13 0.51008695 173 nips-2004-Spike-timing Dependent Plasticity and Mutual Information Maximization for a Spiking Neuron Model

14 0.50981998 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

15 0.5082317 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

16 0.5043571 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

17 0.50348777 118 nips-2004-Methods for Estimating the Computational Power and Generalization Capability of Neural Microcircuits

18 0.50320095 153 nips-2004-Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity

19 0.50274217 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

20 0.50104511 148 nips-2004-Probabilistic Computation in Spiking Populations