nips nips2006 nips2006-1 knowledge-graph by maker-knowledge-mining

1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time


Source: pdf

Author: Michael D. Lee, Ian G. Fuss, Daniel J. Navarro

Abstract: We present a computational Bayesian approach for Wiener diffusion models, which are prominent accounts of response time distributions in decision-making. We first develop a general closed-form analytic approximation to the response time distributions for one-dimensional diffusion processes, and derive the required Wiener diffusion as a special case. We use this result to undertake Bayesian modeling of benchmark data, using posterior sampling to draw inferences about the interesting psychological parameters. With the aid of the benchmark data, we show the Bayesian account has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract We present a computational Bayesian approach for Wiener diffusion models, which are prominent accounts of response time distributions in decision-making. [sent-13, score-0.773]

2 We first develop a general closed-form analytic approximation to the response time distributions for one-dimensional diffusion processes, and derive the required Wiener diffusion as a special case. [sent-14, score-1.148]

3 We use this result to undertake Bayesian modeling of benchmark data, using posterior sampling to draw inferences about the interesting psychological parameters. [sent-15, score-0.59]

4 With the aid of the benchmark data, we show the Bayesian account has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. [sent-16, score-0.575]

5 1 Introduction In the past decade, modern computational Bayesian methods have been productively applied to the modeling of many core psychological phenomena. [sent-17, score-0.177]

6 These areas include similarity modeling and structure learning [1], concept and category learning [2, 3], inductive inference and decisionmaking [4], language processes [5], and individual differences [6]. [sent-18, score-0.096]

7 One central area that has been less affected is the modeling of response times in decision-making. [sent-19, score-0.334]

8 Nevertheless, the time people take to produce behavior is a basic and ubiquitious measure that can constrain models and theories of human cognitive processes [7]. [sent-20, score-0.228]

9 There is a large and well-developed set of competing models that aim to account for accuracy, response time distributions and (sometimes) confidence in decision-making. [sent-21, score-0.453]

10 However, besides the effective application of hierarchical Bayesian methods to models that assume response times follow a Weibull distribution [8], most of the inference remains frequentist. [sent-22, score-0.365]

11 In particular, sequential sampling models of response time, which are the dominant class in the field, have not adopted modern Bayesian methods for inference. [sent-23, score-0.408]

12 The prominent recent review paper by Ratcliff and Smith, for example, relies entirely on frequentist methods for parameter estimation, and does not go beyond the application of the Bayesian Information Criterion for model selection [9]. [sent-24, score-0.172]

13 f (tα,β,δ,ξ) Evidence (x) α α 0 δ ξ β fβ(tα,β,δ,ξ) Response Time (t) Figure 1: A diffusion model for response time distributions for both decisions in a two-choice decision-making task. [sent-33, score-0.885]

14 Much of the utility, however, in using sequential sampling models to understand decision-making, and their application to practical problems [10], requires making inferences about variations in parameters across subjects, stimuli, or experimental conditions. [sent-35, score-0.229]

15 These inferences would benefit from the principled representation of uncertainty inherent in the Bayesian approach. [sent-36, score-0.112]

16 This means their comparison would benefit from Bayesian methods for model selection that do not approximate model complexity by counting the number of free parameters, as the Bayesian Information Criterion does. [sent-38, score-0.076]

17 In this paper, we present a computational Bayesian approach for Wiener diffusion models [11], which are the most widely used special case of the sequential sampling approach. [sent-39, score-0.499]

18 We apply our Bayesian method to the benchmark data of Ratcliff and Rouder [12], using posterior sampling to draw inferences about the interesting psychological parameters. [sent-40, score-0.548]

19 With the aid of this application, we show that adopting the Bayesian perspective has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. [sent-41, score-0.388]

20 1 The Basic Model The basic one-dimensional diffusion model for accuracy and response time distributions in a twochoice decision-making task is shown in Figure 2. [sent-43, score-0.918]

21 Time, t, progresses from left to right, and includes an fixed offset δ that parameterizes the time taken for the non-decision component of response time, such as the time taken to encode the stimulus and complete a motor response. [sent-45, score-0.513]

22 The decision-making component itself is driven by independent samples from an stationary distribution that represents the evidence the stimulus provides in favor of the two alternative decisions. [sent-46, score-0.184]

23 Evidence sampled from this distribution is accrued over time, leading to a diffusion process that is finally absorbed by boundaries above and below at distances α and β from the origin. [sent-48, score-0.487]

24 The response time distribution is then given by the first-passage distribution p (t | α, β, δ, ξ) = fα (t | α, β, δ, ξ) + fβ (t | α, β, δ, ξ), with the areas under fα and fβ giving the proportion of decisions at each boundary. [sent-49, score-0.471]

25 A natural reparameterization is to consider the starting point of evidence accrual z = (β − α) / (α + β), which is considered a measure of bias, and the boundary separation a = α + β, which is considered a measure of caution. [sent-50, score-0.118]

26 In either case, this basic form of the model has four free parameters: ξ, δ and either α and β or z and a. [sent-51, score-0.08]

27 2 Previous Application to Benchmark Data The evolution of Wiener diffusion models of decision-making has involved a series of additional assumptions to address shortcomings in its ability to capture basic empirical regularities. [sent-53, score-0.458]

28 This evolution is well described by Ratcliff and Rouder [12], who, in their Experiment 1 present a diffusion model analysis of a benchmark data set [8, 13]. [sent-54, score-0.542]

29 In this experiment, three observers completed ten 35 minute sessions, each consisting of ten blocks with 102 trials per block. [sent-55, score-0.061]

30 The task of observers was to decide between ‘bright’ and ‘dark’ responses for simple visual stimuli with different proportions of white dots, given noisy feedback about the accuracy of the responses. [sent-56, score-0.381]

31 In addition, the subjects were required to switch between adherence to ‘speed’ instructions and ‘accuracy’ instructions every two blocks. [sent-58, score-0.667]

32 In accord with the experimental design, Ratcliff and Rouder fitted separate drift rates ξi for each of the i = 1, . [sent-59, score-0.34]

33 , 33 stimuli; fitted separate boundaries for speed and accuracy instructions, but assumed the boundaries were symmetric (i. [sent-62, score-0.513]

34 , there was no bias), so that αj = βj for j = 1, 2; and fitted one offset δ for all stimuli and instructions. [sent-64, score-0.114]

35 As Ratcliff and Rouder point out, these trends are not accommodated by the basic model without allowing for variation in the parameters. [sent-66, score-0.135]

36 Accordingly, to predict fast errors, the basic model is extended by assuming that the starting point is subject to between-trial variation, and so is convolved with a Gaussian or uniform distribution. [sent-67, score-0.12]

37 Similarly, to predict slow errors, it is assumed that the mean drift rate is also subject to between-trial variation, and so is convolved with a Gaussian distribution. [sent-68, score-0.411]

38 Instead, we use a new closed-form approximation to the required response time distribution. [sent-71, score-0.364]

39 The key assumption in our approximation is that the evolving diffusion distributions always assume a limiting form f. [sent-74, score-0.47]

40 (3) Weiner diffusion from the origin to boundaries α and β with mean drift rate ξ and variance σ2 (t) = t can be represented in this model by defining f (y) = exp − 1 y2 , rescaling to a variance σ2 (t) = 2 1 1 a2 t and setting µ (t, ξ) = a ξt + z. [sent-77, score-0.866]

41 Thus the response time distributions for Weiner diffusion in this α=β=10 α=β=15 α=15,β=10 Response Time Distribution ξ=0. [sent-78, score-0.733]

42 01 Response Time Figure 2: Comparison between the closed form approximation (dark broken lines) and the infinite sum distributions (light solid lines) for nine realistic combinations of drift rate and boundaries. [sent-81, score-0.533]

43 1 Adequacy of the Wiener Approximation Figure 2 shows the relationship between the response time distributions found by the previous infinite sum method, and those generated by our closed form approximation. [sent-84, score-0.421]

44 10, and boundary combinations α = β = 10, α = β = 15 and α = 15, β = 10 we found the best (least-squares) match between the infinite-sum distribution and those distributions indexed by our approximation. [sent-88, score-0.202]

45 These generating parameter combinations were chosen because they cover the range of the posterior distributions we infer from data later. [sent-89, score-0.214]

46 Figure 2 shows that the approximation provides close matches across these parameterizations, although we note the approximation distributions do seem to use slightly (and apparently systematically) different parameter combinations to generate the best-matching distribution. [sent-90, score-0.197]

47 1 General Model Our log likelihood function evaluates the density of each response time at the boundary corresponding to its associated decision, and assumes independence, so that ln fα (t | α, β, δ, ξ) + ln L (T | α, β, δ, ξ) = t∈Dα ln fβ (t | α, β, δ, ξ) , t∈Dβ (5) αj ξi tijk k = 1, . [sent-93, score-0.674]

48 , 33 δ βj j = 1, 2 Figure 3: Graphical model for the benchmark data analysis. [sent-99, score-0.192]

49 where Dα and Dβ are the set of all response times at the upper and lower boundaries respectively. [sent-100, score-0.429]

50 The graphical model representation of the benchmark data is shown in Figure 3, where the observed response time data are now denoted T = {tijk }, with i = 1, . [sent-102, score-0.491]

51 , 33 indexing the presented stimulus, j = 1, 2 indexing speed or accuracy instructions, and k = 1, . [sent-105, score-0.306]

52 , n indexing all of the trials with this stimulus and instruction combination. [sent-108, score-0.192]

53 We place proper approximations to non-informative distributions on all the parameters, so that they are all essentially flat over the values of interest. [sent-109, score-0.084]

54 Specifically we assume the 33 drift rates are independent and each have a zero mean Gaussian prior with very small precision: ξi ∼ Gaussian (0, τ ), with τ = 10−6. [sent-110, score-0.34]

55 The boundary parameters α and β are given the same priors, but because they are constrained to be positive, their sampling is censored accordingly: αj , βj ∼ Gaussian (0, τ ) ; αj , βj > 0. [sent-111, score-0.134]

56 2 Formalizing Model Construction The Bayesian approach allows us to test the intuitively plausible model construction decisions made previously by Ratcliff and Rouder. [sent-115, score-0.18]

57 ), we considered the marginal likelihoods, denoted simply L, based on the harmonic mean approximation [15], calculated from three chains of 105 samples from the posterior obtained using Winbugs. [sent-118, score-0.125]

58 • The full model described by Figure 3, with asymmetric boundaries, varying across speed and accuracy instructions. [sent-119, score-0.325]

59 • The restricted model with symmetric boundaries, still varying across instructions, as assumed by Ratcliff and Rouder. [sent-121, score-0.131]

60 • The restricted model with asymmetric boundaries not varying across instructions. [sent-123, score-0.285]

61 • The restricted model with symmetric boundaries not varying across instructions. [sent-125, score-0.268]

62 These marginal log likelihoods make it clear that different boundaries are needed for the speed and accuracy instructions, but it is overly complicated to allow them to be asymmetric (i. [sent-127, score-0.426]

63 We tested the robustness of these values by halving and doubling the prior variances, and using an adapted form of the ‘informative’ priors collated in [16], all of which lead to similar quantitative and identical qualitative conclusions. [sent-130, score-0.074]

64 These results formally justify the model construction decisions made by Ratcliff and Rouder, and the remainder of our analysis applies to this restricted model. [sent-131, score-0.211]

65 3 Posterior Distributions Figure 4 shows the posterior distributions for the symmetric boundaries under both speed and accuracy instructions. [sent-133, score-0.549]

66 These distributions are consistent with traditional analyses and the speed boundary is clearly significantly smaller than the accuracy boundary. [sent-134, score-0.431]

67 We note that, for historical reasons only, 2 Wiener diffusion models have assumed σ2 (t) = (0. [sent-135, score-0.387]

68 p(αT) Speed Accuracy 0 5 10 Boundary Values 15 20 Figure 4: Posterior distributions for the boundaries under speed and accuracy instructions. [sent-137, score-0.429]

69 The main panel of Figure 5 shows the posterior distributions for all 33 drift rate parameters. [sent-138, score-0.591]

70 The posteriors are shown against the vertical axes, with wider bars corresponding to greater density, and are located according to their proportion of white dots on the horizontal axis. [sent-139, score-0.15]

71 The approximately monotonic relationship between drift rate and proportion shows that the model allows stimulus properties to be inferred from the behavioral decision time data, as found by previous analyses. [sent-140, score-0.737]

72 The right hand panel of Figure 5 shows the projection of three of the posterior distributions, labelled 4, 17 and 22. [sent-141, score-0.166]

73 It is interesting to note that the uncertainty about the drift rate of stimuli 4 and 17 both take a Gaussian form, but with very different variances. [sent-142, score-0.456]

74 More dramatically, the uncertainty about the drift rate for stimulus 22 is clearly bi-modal. [sent-143, score-0.516]

75 9 1 p(ξiT) Figure 5: Posterior distributions for the 33 drift rates, in terms of the proportion of white dots in their associated stimuli. [sent-159, score-0.541]

76 4 Accuracy and Fast and Slow Errors Figure 6 follows previous analyses of these data, and shows the relationship between the empirical proportions, and the model prediction of decision proportions for each stimulus type. [sent-161, score-0.376]

77 For both the speed and accuracy instructions, there is close agreement between the model and data. [sent-162, score-0.246]

78 Figure 7 shows the posterior predictive distribution of the model for two cases, analogous to those highlighted previously [13, Figure 6]. [sent-163, score-0.155]

79 The left panel involves a relatively easy decision, corresponding to stimulus number 22 under speed instructions, and shows the models predictions for the response time for both correct (upper) and error (lower) decisions, together with the data, indicated by short vertical lines. [sent-164, score-0.695]

80 For this easy decision, it can be seen the model predicts relatively fast errors. [sent-165, score-0.074]

81 The right panel of Figure 7 involves a harder decision, corresponding to stimulus number 18 under Speed Accuracy 0. [sent-166, score-0.22]

82 8 1 Figure 6: Relationship between modeled and empirical accuracy, for the speed instructions (left panel) and accuracy instructions (right panel). [sent-182, score-0.846]

83 Here the model predicts much slower errors, with a heavier tail than for the easy decision. [sent-185, score-0.074]

84 Response Time Distribution These are the basic qualitative properties of prediction that motivated the introduction of betweentrial variability through noise processes in the traditional account. [sent-186, score-0.196]

85 In the present Bayesian treatment, the required predictions are achieved because the posterior predictive automatically samples from a range of values for the drift and boundary parameters. [sent-187, score-0.53]

86 By representing this variation in parameters as uncertainty about fixed values, we are making different basic assumptions from the traditional Wiener diffusion model. [sent-188, score-0.508]

87 It is interesting to speculate that, if Bayesian results like those in Figure 7 had always been available, the introduction of additional variability processes described in [12] might never have eventuated. [sent-189, score-0.123]

88 These processes seem solely designed to account for empirical effects like the cross-over effect; in particular, we are not aware of the parameters of the additional variability processes being used to draw substantive psychological inferences from data. [sent-190, score-0.458]

89 0 1500 Response Time (ms) 0 6000 Response Time (ms) Figure 7: Posterior predictive distributions for both correct (solid line) and error (broken line) responses, for two stimuli corresponding to easy (left panel) and hard (right panel) decisions. [sent-191, score-0.231]

90 The density for error decisions in the easy responses has been scaled to allow its shape to be visible. [sent-192, score-0.182]

91 5 Conclusions Our analyses of the benchmark data confirm many of the central conclusions of previous analyses, but also make several new contributions. [sent-193, score-0.216]

92 The posterior distributions shown in Figure 5 suggest that current parametric assumptions about drift rate variability may not be entirely appropriate. [sent-194, score-0.642]

93 In particular, there is the intriguing possibility of multi-modalities evident in the drift rate of stimulus 22, and the associated raw data in Figure 7. [sent-195, score-0.484]

94 Figure 5 also suggests a hierarchical account of the benchmark data, modeling the 33 drift rates ξi in terms of, for example, a low-dimensional psychometric function. [sent-196, score-0.605]

95 It should also be possible to introduce contaminant distributions in a mixture model, following previous suggestions [8, 14], using latent variable assignments for each response time. [sent-198, score-0.401]

96 If it was desirable to replicate the current assumptions of starting-point and drift-rate variability, that would also easily be done in an extended hierarchical account. [sent-199, score-0.065]

97 Acknowledgments We thank Jeff Rouder for supplying the benchmark data, and Scott Brown, E. [sent-201, score-0.154]

98 A comparison of sequential sampling models for two–choice reaction time. [sent-274, score-0.204]

99 The effects of aging on reaction time in a signal detection task. [sent-280, score-0.153]

100 Estimating parameters of the diffusion model: Approaches to dealing with contaminant reaction times and parameter variability. [sent-300, score-0.532]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('diffusion', 0.35), ('ratcliff', 0.348), ('instructions', 0.319), ('drift', 0.307), ('response', 0.259), ('rouder', 0.203), ('wiener', 0.184), ('benchmark', 0.154), ('stimulus', 0.143), ('boundaries', 0.137), ('psychological', 0.135), ('decisions', 0.114), ('bayesian', 0.113), ('accuracy', 0.105), ('speed', 0.103), ('posterior', 0.089), ('distributions', 0.084), ('stimuli', 0.083), ('ln', 0.08), ('inferences', 0.08), ('boundary', 0.077), ('panel', 0.077), ('bulletin', 0.076), ('variability', 0.069), ('psychonomic', 0.069), ('review', 0.064), ('analyses', 0.062), ('observers', 0.061), ('grif', 0.06), ('psychology', 0.06), ('proportion', 0.058), ('aging', 0.058), ('contaminant', 0.058), ('tijk', 0.058), ('wagenmakers', 0.058), ('weiner', 0.058), ('sampling', 0.057), ('reaction', 0.055), ('cognitive', 0.055), ('sequential', 0.055), ('variation', 0.055), ('processes', 0.054), ('proportions', 0.053), ('indexing', 0.049), ('asymmetric', 0.048), ('white', 0.047), ('irvine', 0.046), ('ths', 0.046), ('navarro', 0.046), ('steyvers', 0.046), ('dots', 0.045), ('tted', 0.044), ('quantitative', 0.043), ('modeling', 0.042), ('basic', 0.042), ('decision', 0.042), ('combinations', 0.041), ('evidence', 0.041), ('convolved', 0.04), ('prominent', 0.04), ('time', 0.04), ('aid', 0.038), ('relationship', 0.038), ('model', 0.038), ('models', 0.037), ('errors', 0.037), ('behavioral', 0.037), ('dealing', 0.036), ('easy', 0.036), ('hierarchical', 0.036), ('sciences', 0.036), ('approximation', 0.036), ('parameterizations', 0.035), ('sa', 0.035), ('rate', 0.034), ('account', 0.033), ('draw', 0.033), ('likelihoods', 0.033), ('rates', 0.033), ('times', 0.033), ('uncertainty', 0.032), ('responses', 0.032), ('restricted', 0.031), ('symmetric', 0.031), ('australia', 0.031), ('broken', 0.031), ('brown', 0.031), ('guide', 0.031), ('offset', 0.031), ('qualitative', 0.031), ('varying', 0.031), ('entirely', 0.03), ('slow', 0.03), ('assumptions', 0.029), ('required', 0.029), ('dark', 0.028), ('ms', 0.028), ('construction', 0.028), ('predictive', 0.028), ('accordingly', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

Author: Michael D. Lee, Ian G. Fuss, Daniel J. Navarro

Abstract: We present a computational Bayesian approach for Wiener diffusion models, which are prominent accounts of response time distributions in decision-making. We first develop a general closed-form analytic approximation to the response time distributions for one-dimensional diffusion processes, and derive the required Wiener diffusion as a special case. We use this result to undertake Bayesian modeling of benchmark data, using posterior sampling to draw inferences about the interesting psychological parameters. With the aid of the benchmark data, we show the Bayesian account has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. 1

2 0.15208153 128 nips-2006-Manifold Denoising

Author: Matthias Hein, Markus Maier

Abstract: We consider the problem of denoising a noisily sampled submanifold M in Rd , where the submanifold M is a priori unknown and we are only given a noisy point sample. The presented denoising algorithm is based on a graph-based diffusion process of the point sample. We analyze this diffusion process using recent results about the convergence of graph Laplacians. In the experiments we show that our method is capable of dealing with non-trivial high-dimensional noise. Moreover using the denoising algorithm as pre-processing method we can improve the results of a semi-supervised learning algorithm. 1

3 0.10546398 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

Author: Daniel J. Navarro, Thomas L. Griffiths

Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1

4 0.095776744 192 nips-2006-Theory and Dynamics of Perceptual Bistability

Author: Paul R. Schrater, Rashmi Sundareswara

Abstract: Perceptual Bistability refers to the phenomenon of spontaneously switching between two or more interpretations of an image under continuous viewing. Although switching behavior is increasingly well characterized, the origins remain elusive. We propose that perceptual switching naturally arises from the brain’s search for best interpretations while performing Bayesian inference. In particular, we propose that the brain explores a posterior distribution over image interpretations at a rapid time scale via a sampling-like process and updates its interpretation when a sampled interpretation is better than the discounted value of its current interpretation. We formalize the theory, explicitly derive switching rate distributions and discuss qualitative properties of the theory including the effect of changes in the posterior distribution on switching rates. Finally, predictions of the theory are shown to be consistent with measured changes in human switching dynamics to Necker cube stimuli induced by context.

5 0.08891385 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

Author: Jeremy Lewi, Robert Butera, Liam Paninski

Abstract: Adaptively optimizing experiments can significantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efficient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efficiently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efficiently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efficiency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be defined a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in finding a neuron’s “preferred stimulus,” is the stimulus which maximizes the firing rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to define the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identification approach: we define the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), specified by a finite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difficult to realize this goal in a computationally efficient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) find the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on finding solutions which scale well in high dimensions. We solve problem (1) by using efficient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efficient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous firing rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; briefly, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-fire cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the filter coefficients ki,t−l capture the dependence of the neuron’s instantaneous firing rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive fields. For convenience, we arrange all the stimulus coefficients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive field. The coefficients aj model the dependence on the observed recent activity r at time t − j (these terms may reflect e.g. refractory effects, burstiness, firing-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown filter coefficients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our first key task is to efficiently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive field for a simulated visual neuron. The neuron’s receptive field θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-fit by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justified by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simplifies matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to find the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the first term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efficiency; our simulations (data not shown) showed that, despite this improved efficiency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the firing rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The first term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefficients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefficients, k. B) The estimated spike history coefficients, a. C) The final estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we find that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undefined, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly defined by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by fixing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then find the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive field when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to find the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch significantly faster and more stable. To summarize, updating the posterior and finding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modification of Ct−1 , and there exist efficient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modification of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deflation which reduces the cost of matrix multiplications associated with finding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we fixed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to find the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefficients, k, and the spike history coefficients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our first simulation considered the use of our algorithm for learning the receptive field of a visually sensitive neuron. We took the neuron’s receptive field to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efficient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate fitted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive field of a neuron whose response depends on its past spiking. Here, the stimulus coefficients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modification of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modification. the spike history coefficients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefficients and the spike history coefficients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefficients. Our final set of results, Fig. 3, considers a neuron whose receptive field drifts non-systematically with time. We take the receptive field to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efficient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides significant improvements over random sampling; indeed, the differences in efficiency are large enough that the information-optimization strategy may permit robust system identification in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique significantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).

6 0.063964896 154 nips-2006-Optimal Change-Detection and Spiking Neurons

7 0.063897096 189 nips-2006-Temporal dynamics of information content carried by neurons in the primary visual cortex

8 0.060886323 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing

9 0.060110271 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

10 0.05907166 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

11 0.058765262 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

12 0.058306213 113 nips-2006-Learning Structural Equation Models for fMRI

13 0.055820785 80 nips-2006-Fundamental Limitations of Spectral Clustering

14 0.055790927 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

15 0.054955121 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

16 0.054650109 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

17 0.053101104 41 nips-2006-Bayesian Ensemble Learning

18 0.052357152 31 nips-2006-Analysis of Contour Motions

19 0.051775549 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

20 0.051095378 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.18), (1, -0.036), (2, 0.04), (3, -0.042), (4, -0.001), (5, -0.021), (6, 0.097), (7, -0.021), (8, -0.011), (9, -0.035), (10, 0.068), (11, 0.086), (12, -0.0), (13, 0.02), (14, -0.133), (15, -0.005), (16, 0.04), (17, -0.031), (18, -0.145), (19, -0.07), (20, 0.025), (21, 0.02), (22, 0.06), (23, -0.136), (24, 0.063), (25, 0.05), (26, 0.06), (27, 0.041), (28, -0.084), (29, -0.076), (30, -0.023), (31, -0.121), (32, 0.095), (33, 0.089), (34, -0.004), (35, 0.017), (36, -0.025), (37, 0.124), (38, -0.151), (39, 0.007), (40, -0.019), (41, 0.16), (42, -0.038), (43, 0.01), (44, 0.2), (45, 0.093), (46, 0.106), (47, -0.095), (48, -0.006), (49, -0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94746 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

Author: Michael D. Lee, Ian G. Fuss, Daniel J. Navarro

Abstract: We present a computational Bayesian approach for Wiener diffusion models, which are prominent accounts of response time distributions in decision-making. We first develop a general closed-form analytic approximation to the response time distributions for one-dimensional diffusion processes, and derive the required Wiener diffusion as a special case. We use this result to undertake Bayesian modeling of benchmark data, using posterior sampling to draw inferences about the interesting psychological parameters. With the aid of the benchmark data, we show the Bayesian account has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. 1

2 0.67651093 192 nips-2006-Theory and Dynamics of Perceptual Bistability

Author: Paul R. Schrater, Rashmi Sundareswara

Abstract: Perceptual Bistability refers to the phenomenon of spontaneously switching between two or more interpretations of an image under continuous viewing. Although switching behavior is increasingly well characterized, the origins remain elusive. We propose that perceptual switching naturally arises from the brain’s search for best interpretations while performing Bayesian inference. In particular, we propose that the brain explores a posterior distribution over image interpretations at a rapid time scale via a sampling-like process and updates its interpretation when a sampled interpretation is better than the discounted value of its current interpretation. We formalize the theory, explicitly derive switching rate distributions and discuss qualitative properties of the theory including the effect of changes in the posterior distribution on switching rates. Finally, predictions of the theory are shown to be consistent with measured changes in human switching dynamics to Necker cube stimuli induced by context.

3 0.53599507 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

Author: Michael C. Mozer, Michael Shettel, Michael P. Holmes

Abstract: Categorization is a central activity of human cognition. When an individual is asked to categorize a sequence of items, context effects arise: categorization of one item influences category decisions for subsequent items. Specifically, when experimental subjects are shown an exemplar of some target category, the category prototype appears to be pulled toward the exemplar, and the prototypes of all nontarget categories appear to be pushed away. These push and pull effects diminish with experience, and likely reflect long-term learning of category boundaries. We propose and evaluate four principled probabilistic (Bayesian) accounts of context effects in categorization. In all four accounts, the probability of an exemplar given a category is encoded as a Gaussian density in feature space, and categorization involves computing category posteriors given an exemplar. The models differ in how the uncertainty distribution of category prototypes is represented (localist or distributed), and how it is updated following each experience (using a maximum likelihood gradient ascent, or a Kalman filter update). We find that the distributed maximum-likelihood model can explain the key experimental phenomena. Further, the model predicts other phenomena that were confirmed via reanalysis of the experimental data. Categorization is a key cognitive activity. We continually make decisions about characteristics of objects and individuals: Is the fruit ripe? Does your friend seem unhappy? Is your car tire flat? When an individual is asked to categorize a sequence of items, context effects arise: categorization of one item influences category decisions for subsequent items. Intuitive naturalistic scenarios in which context effects occur are easy to imagine. For example, if one lifts a medium-weight object after lifting a light-weight or heavy-weight object, the medium weight feels heavier following the light weight than following the heavy weight. Although the object-contrast effect might be due to fatigue of sensory-motor systems, many context effects in categorization are purely cognitive and cannot easily be attributed to neural habituation. For example, if you are reviewing a set of conference papers, and the first three in the set are dreadful, then even a mediocre paper seems like it might be above threshold for acceptance. Another example of a category boundary shift due to context is the following. Suppose you move from San Diego to Pittsburgh and notice that your neighbors repeatedly describe muggy, somewhat overcast days as ”lovely.” Eventually, your notion of what constitutes a lovely day accommodates to your new surroundings. As we describe shortly, experimental studies have shown a fundamental link between context effects in categorization and long-term learning of category boundaries. We believe that context effects can be viewed as a reflection of a trial-to-trial learning, and the cumulative effect of these trial-to-trial modulations corresponds to what we classically consider to be category learning. Consequently, any compelling model of category learning should also be capable of explaining context effects. 1 Experimental Studies of Context Effects in Categorization Consider a set of stimuli that vary along a single continuous dimension. Throughout this paper, we use as an illustration circles of varying diameters, and assume four categories of circles defined ranges of diameters; call them A, B, C, and D, in order from smallest to largest diameter. In a classification paradigm, experimental subjects are given an exemplar drawn from one category and are asked to respond with the correct category label (Zotov, Jones, & Mewhort, 2003). After making their response, subjects receive feedback as to the correct label, which we’ll refer to as the target. In a production paradigm, subjects are given a target category label and asked to produce an exemplar of that category, e.g., using a computer mouse to indicate the circle diameter (Jones & Mewhort, 2003). Once a response is made, subjects receive feedback as to the correct or true category label for the exemplar they produced. Neither classification nor production task has sequential structure, because the order of trial is random in both experiments. The production task provides direct information about the subjects’ internal representations, because subjects are producing exemplars that they consider to be prototypes of a category, whereas the categorization task requires indirect inferences to be made about internal representations from reaction time and accuracy data. Nonetheless, the findings in the production and classification tasks mirror one another nicely, providing converging evidence as to the nature of learning. The production task reveals how mental representations shift as a function of trial-to-trial sequences, and these shifts cause the sequential pattern of errors and response times typically observed in the classification task. We focus on the production task in this paper because it provides a richer source of data. However, we address the categorization task with our models as well. Figure 1 provides a schematic depiction of the key sequential effects in categorization. The horizontal line represents the stimulus dimension, e.g., circle diameter. The dimension is cut into four regions labeled with the corresponding category. The category center, which we’ll refer to as the prototype, is indicated by a vertical dashed line. The long solid vertical line marks the current exemplar—whether it is an exemplar presented to subjects in the classification task or an exemplar generated by subjects in the production task. Following an experimental trial with this exemplar, category prototypes appear to shift: the target-category prototype moves toward the exemplar, which we refer to as a pull effect, and all nontarget-category prototypes move away from the exemplar, which we refer to as a push effect. Push and pull effects are assessed in the production task by examining the exemplar produced on the following trial, and in the categorization task by examining the likelihood of an error response near category boundaries. The set of phenomena to be explained are as follows, described in terms of the production task. All numerical results referred to are from Jones and Mewhort (2003). This experiment consisted of 12 blocks of 40 trials, with each category label given as target 10 times within a block. • Within-category pull: When a target category is repeated on successive trials, the exemplar generated on the second trial moves toward the exemplar generated on the first trial, with respect to the true category prototype. Across the experiment, a correlation coefficient of 0.524 is obtained, and remains fairly constant over trials. • Between-category push: When the target category changes from one trial to the next, the exemplar generated on the second trial moves away from the exemplar generated on the first trial (or equivalently, from the prototype of the target category on the first trial). Figure 2a summarizes the sequential push effects from Jones and Mewhort. The diameter of the circle produced on trial t is plotted as a function of the target category on trial t − 1, with one line for each of the four trial t targets. The mean diameter for each target category is subtracted out, so the absolute vertical offset of each line is unimportant. The main feature of the data to note is that all four curves have a negative slope, which has the following meaning: the smaller that target t − 1 is (i.e., the further to the left on the x axis in Figure 1), the larger the response to target t is (further to the right in Figure 1), and vice versa, reflecting a push away from target t − 1. Interestingly and importantly, the magnitude of the push increases with the ordinal distance between targets t − 1 and t. Figure 2a is based on data from only eight subjects and is therefore noisy, though the effect is statistically reliable. As further evidence, Figure 2b shows data from a categorization task (Zotov et al., 2003), where the y-axis is a different dependent measure, but the negative slope has the same interpretation as in Figure 2a. example Figure 1: Schematic depiction of sequential effects in categorization A B C D stimulus dimension C 0 B −0.02 −0.04 −0.06 −0.08 −0.1 0.06 0.04 0.04 response deviation D 0.02 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C −0.1 D previous category label 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C −0.1 D previous category label (b) humans: classification (d) KFU−distrib 0.08 D previous category label C D 0.06 0.04 0.04 response deviation response deviation C B (f) MLGA−distrib 0.08 0.02 0 −0.02 −0.04 −0.06 −0.08 B A previous category label 0.06 A (e) MLGA−local 0.08 0.06 A 0.04 (c) KFU−local 0.08 response deviation response deviation 0.06 response bias from (a) production task of Jones and Mewhort (2003), (b) classification task of Zotov et al. (2003), and (c)-(f) the models proposed in this paper. The y axis is the deviation of the response from the mean, as a proportion of the total category width. The response to category A is solid red, B is dashed magenta, C is dash-dotted blue, and D is dotted green. (a) humans: production 0.08 Figure 2: Push effect data −0.1 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C D previous category label −0.1 A B C D previous category label • Push and pull effects are not solely a consequence of errors or experimenter feedback. In quantitative estimation of push and pull effects, trial t is included in the data only if the response on trial t − 1 is correct. Thus, the effects follow trials in which no error feedback is given to the subjects, and therefore the adjustments are not due to explicit error correction. • Push and pull effects diminish over the course of the experiment. The magnitude of push effects can be measured by the slope of the regression lines fit to the data in Figure 2a. The slopes get shallower over successive trial blocks. The magnitude of pull effects can be measured by the standard deviation (SD) of the produced exemplars, which also decreases over successive trial blocks. • Accuracy increases steadily over the course of the experiment, from 78% correct responses in the first block to 91% in the final block. This improvement occurs despite the fact that error feedback is relatively infrequent and becomes even less frequent as performance improves. 2 Four Models In this paper, we explore four probabilistic (Bayesian) models to explain data described in the previous section. The key phenomenon to explain turns out to be the push effect, for which three of the four models fail to account. Modelers typically discard the models that they reject, and present only their pet model. In this work, we find it useful to report on the rejected models for three reasons. First, they help to set up and motivate the one successful model. Second, they include several obvious candidates, and we therefore have the imperative to address them. Third, in order to evaluate a model that can explain certain data, one needs to know the degree to which the the data constrain the space of models. If many models exist that are consistent with the data, one has little reason to prefer our pet candidate. Underlying all of the models is a generative probabilistic framework in which a category i is represented by a prototype value, di , on the dimension that discriminates among the categories. In the example used throughout this paper, the dimension is the diameter of a circle (hence the notation d for the prototype). An exemplar, E, of category i is drawn from a Gaussian distribution with mean di and variance vi , denoted E ∼ N (di , vi ). Category learning involves determining d ≡ {di }. In this work, we assume that the {vi } are fixed and given. Because d is unknown at the start of the experiment, it is treated as the value of a random vector, D ≡ {Di }. Figure 3a shows a simple graphical model representing the generative framework, in which E is the exemplar and C the category label. To formalize our discussion so far, we adopt the following notation: P (E|C = c, D = d) ∼ N (hc d, vc ), (1) where, for the time being, hc is a unary column vector all of whose elements are zero except for element c which has value 1. (Subscripts may indicate either an index over elements of a vector, or an index over vectors. Boldface is used for vectors and matrices.) Figure 3: (a) Graphical model depicting selection of an exemplar, E, of a category, C, based on the prototype vector, D; (b) Dynamic version of model indexed by trials, t (a) D (b) C Dt-1 Ct-1 E Dt Ct Et-1 Et We assume that the prototype representation, D, is multivariate Gaussian, D ∼ N (Ψ, Σ), where Ψ and Σ encode knowledge—and uncertainty in the knowledge—of the category prototype structure. Given this formulation, the uncertainty in D can be integrated out: P (E|C) ∼ N (hc Ψ, hc ΣhT + vc ). c (2) For the categorization task, a category label can be assigned by evaluating the category posterior, P (C|E), via Bayes rule, Equation 1, and the category priors, P (C). In this framework, learning takes place via trial-to-trial adaptation of the category prototype distribution, D. In Figure 3b, we add the subscript t to each random variable to denote the trial, yielding a dynamic graphical model for the sequential updating of the prototype vector, Dt . (The reader should be attentive to the fact that we use subscripted indices to denote both trials and category labels. We generally use the index t to denote trial, and c or i to denote a category label.) The goal of our modeling work is to show that the sequential updating process leads to context effects, such as the push and pull effects discussed earlier. We propose four alternative models to explore within this framework. The four models are obtained via the Cartesian product of two binary choices: the learning rule and the prototype representation. 2.1 Learning rule The first learning rule, maximum likelihood gradient ascent (MLGA), attempts to adjust the prototype representation so as to maximize the log posterior of the category given the exemplar. (The category, C = c, is the true label associated with the exemplar, i.e., either the target label the subject was asked to produce, or—if an error was made—the actual category label the subject did produce.) Gradient ascent is performed in all parameters of Ψ and Σ: ∆ψi = ψ ∂ log(P (c|e)) and ∂ψi ∆σij = σ ∂ log(P (c|e)), ∂σij (3) where ψ and σ are step sizes. To ensure that Σ remains a covariance matrix, constrained gradient 2 steps are applied. The constraints are: (1) diagonal terms are nonnegative, i.e., σi ≥ 0; (2) offdiagonal terms are symmetric, i.e., σij = σji ; and (3) the matrix remains positive definite, ensured σ by −1 ≤ σiijj ≤ 1. σ The second learning rule, a Kalman filter update (KFU), reestimates the uncertainty distribution of the prototypes given evidence provided by the current exemplar and category label. To draw the correspondence between our framework and a Kalman filter: the exemplar is a scalar measurement that pops out of the filter, the category prototypes are the hidden state of the filter, the measurement noise is vc , and the linear mapping from state to measurement is achieved by hc . Technically, the model is a measurement-switched Kalman filter, where the switching is determined by the category label c, i.e., the measurement function, hc , and noise, vc , are conditioned on c. The Kalman filter also allows temporal dynamics via the update equation, dt = Adt−1 , as well as internal process noise, whose covariance matrix is often denoted Q in standard Kalman filter notation. We investigated the choice of A and R, but because they did not impact the qualitative outcome of the simulations, we used A = I and R = 0. Given the correspondence we’ve established, the KFU equations—which specify Ψt+1 and Σt+1 as a function of ct , et , Ψt , and Σt —can be found in an introductory text (e.g., Maybeck, 1979). Change to a category prototype for each category following a trial of a given category. Solid (open) bars indicate trials in which the exemplar is larger (smaller) than the prototype. 2.2 prototype mvt. Figure 4: 0.2 trial t −1: A 0 −0.2 0.2 trial t −1: B 0 A B C trial t D −0.2 0.2 trial t −1: C 0 A B C trial t D −0.2 0.2 trial t −1: D 0 A B C trial t D −0.2 A B C trial t D Representation of the prototype The prototype representation that we described is localist: there is a one-to-one correspondence between the prototype for each category i and the random variable Di . To select the appropriate prototype given a current category c, we defined the unary vector hc and applied hc as a linear transform on D. The identical operations can be performed in conjunction with a distributed representation of the prototype. But we step back momentarily to motivate the distributed representation. The localist representation suffers from a key weakness: it does not exploit interrelatedness constraints on category structure. The task given to experimental subjects specifies that there are four categories, and they have an ordering; the circle diameters associated with category A are smaller than the diameters associated with B, etc. Consequently, dA < dB < dC < dD . One might make a further assumption that the category prototypes are equally spaced. Exploiting these two sources of domain knowledge leads to the distributed representation of category structure. A simple sort of distributed representation involves defining the prototype for category i not as di but as a linear function of an underlying two-dimensional state-space representation of structure. In this state space, d1 indicates the distance between categories and d2 an offset for all categories. This representation of state can be achieved by applying Equation 1 and defining hc = (nc , 1), where nc is the ordinal position of the category (nA = 1, nB = 2, etc.). We augment this representation with a bit of redundancy by incorporating not only the ordinal positions but also the reverse ordinal positions; this addition yields a symmetry in the representation between the two ends of the ordinal category scale. As a result of this augmentation, d becomes a three-dimensional state space, and hc = (nc , N + 1 − nc , 1), where N is the number of categories. To summarize, both the localist and distributed representations posit the existence of a hidden-state space—unknown at the start of learning—that specifies category prototypes. The localist model assumes one dimension in the state space per prototype, whereas the distributed model assumes fewer dimensions in the state space—three, in our proposal—than there are prototypes, and computes the prototype location as a function of the state. Both localist and distributed representations assume a fixed, known {hc } that specify the interpretation of the state space, or, in the case of the distributed model, the subject’s domain knowledge about category structure. 3 Simulation Methodology We defined a one-dimensional feature space in which categories A-D corresponded to the ranges [1, 2), [2, 3), [3, 4), and [4, 5), respectively. In the human experiment, responses were considered incorrect if they were smaller than A or larger than D; we call these two cases out-of-bounds-low (OOBL) and out-of-bounds-high (OOBH). OOBL and OOBH were treated as two additional categories, resulting in 6 categories altogether for the simulation. Subjects and the model were never asked to produce exemplars of OOBL or OOBH, but feedback was given if a response fell into these categories. As in the human experiment, our simulation involved 480 trials. We performed 100 replications of each simulation with identical initial conditions but different trial sequences, and averaged results over replications. All prototypes were initialized to have the same mean, 3.0, at the start of the simulation. Because subjects had some initial practice on the task before the start of the experimental trials, we provided the models with 12 initial trials of a categorization (not production) task, two for each of the 6 categories. (For the MLGA models, it was necessary to use a large step size on these trials to move the prototypes to roughly the correct neighborhood.) To perform the production task, the models must generate an exemplar given a category. It seems natural to draw an exemplar from the distribution in Equation 2 for P (E|C). However, this distribu- tion reflects the full range of exemplars that lie within the category boundaries, and presumably in the production task, subjects attempt to produce a prototypical exemplar. Consequently, we exclude the intrinsic category variance, vc , from Equation 2 in generating exemplars, leaving variance only via uncertainty about the prototype. Each model involved selection of various parameters and initial conditions. We searched the parameter space by hand, attempting to find parameters that satisfied basic properties of the data: the accuracy and response variance in the first and second halves of the experiment. We report only parameters for the one model that was successful, the MLGA-Distrib: ψ = 0.0075, σ = 1.5 × 10−6 for off-diagonal terms and 1.5 × 10−7 for diagonal terms (the gradient for the diagonal terms was relatively steep), Σ0 = 0.01I, and for all categories c, vc = 0.42 . 4 4.1 Results Push effect The phenomenon that most clearly distinguishes the models is the push effect. The push effect is manifested in sequential-dependency functions, which plot the (relative) response on trial t as a function of trial t − 1. As we explained using Figures 2a,b, the signature of the push effect is a negatively sloped line for each of the different trial t target categories. The sequential-dependency functions for the four models are presented in Figures 2c-f. KFU-Local (Figure 2c) produces a flat line, indicating no push whatsoever. The explanation for this result is straightforward: the Kalman filter update alters only the variable that is responsible for the measurement (exemplar) obtained on that trial. That variable is the prototype of the target class c, Dc . We thought the lack of an interaction among the category prototypes might be overcome with KFU-Distrib, because with a distributed prototype representation, all of the state variables jointly determine the target category prototype. However, our intuition turned out to be incorrect. We experimented with many different representations and parameter settings, but KFU-Distrib consistently obtained flat or shallow positive sloping lines (Figure 2d). MLGA-Local (Figure 2e) obtains a push effect for neighboring classes, but not distant classes. For example, examining the dashed magenta line, note that B is pushed away by A and C, but is not affected by D. MLGA-Local maximizes the likelihood of the target category both by pulling the classconditional density of the target category toward the exemplar and by pushing the class-conditional densities of the other categories away from the exemplar. However, if a category has little probability mass at the location of the exemplar, the increase in likelihood that results from pushing it further away is negligible, and consequently, so is the push effect. MLGA-Distrib obtains a lovely result (Figure 2f)—a negatively-sloped line, diagnostic of the push effect. The effect magnitude matches that in the human data (Figure 2a), and captures the key property that the push effect increases with the ordinal distance of the categories. We did not build a mechanism into MLGA-Distrib to produce the push effect; it is somewhat of an emergent property of the model. The state representation of MLGA-Distrib has three components: d1 , the weight of the ordinal position of a category prototype, d2 , the weight of the reverse ordinal position, and d3 , an offset. The last term, d3 , cannot be responsible for a push effect, because it shifts all prototypes equally, and therefore can only produce a flat sequential dependency function. Figure 4 helps provide an intuition how d1 and d2 work together to produce the push effect. Each graph shows the average movement of the category prototype (units on the y-axis are arbitrary) observed on trial t, for each of the four categories, following presentation of a given category on trial t − 1. Positve values on the y axis indicate increases in the prototype (movement to the right in Figure 1), and negative values decreases. Each solid vertical bar represents the movement of a given category prototype following a trial in which the exemplar is larger than its current prototype; each open vertical bar represents movement when the exemplar is to the left of its prototype. Notice that all category prototypes get larger or smaller on a given trial. But over the course of the experiment, the exemplar should be larger than the prototype as often as it is smaller, and the two shifts should sum together and partially cancel out. The result is the value indicated by the small horizontal bar along each line. The balance between the shifts in the two directions exactly corresponds to the push effect. Thus, the model produce a push-effect graph, but it is not truly producing a push effect as was originally conceived by the experimentalists. We are currently considering empirical consequences of this simulation result. Figure 5 shows a trial-by-trial trace from MLGA-Distrib. (a) class prototype 2 50 100 150 200 250 300 350 400 6 4 2 0 50 100 150 200 250 300 350 400 450 −4 50 100 150 200 250 50 100 150 200 300 350 400 450 250 300 350 400 450 300 350 400 450 300 350 400 450 0.2 0 −0.2 50 100 150 200 250 (f) 1 −2 −6 0.6 (e) (c) 0 0.8 0.4 450 (b) posterior log(class variance) P(correct) 4 0 (d) 1 shift (+=toward −=away) example 6 0.8 0.6 0.4 50 100 150 200 250 Figure 5: Trial-by-trial trace of MLGA-Distrib. (a) exemplars generated on one run of the simulation; (b) the mean and (c) variance of the class prototype distribution for the 6 classes on one run; (d) mean proportion correct over 100 replications of the simulation; (e) push and pull effects, as measured by changes to the prototype means: the upper (green) curve is the pull of the target prototype mean toward the exemplar, and the lower (red) curve is the push of the nontarget prototype means away from the exemplar, over 100 replications; (f) category posterior of the generated exemplar over 100 replications, reflecting gradient ascent in the posterior. 4.2 Other phenomena accounted for MLGA-Distrib captures the other phenomena we listed at the outset of this paper. Like all of the other models, MLGA-Distrib readily produces a pull effect, which is shown in the movement of category prototypes in Figure 5e. More observably, a pull effect is manifested when two successive trials of the same category are positively correlated: when trial t − 1 is to the left of the true category prototype, trial t is likely to be to the left as well. In the human data, the correlation coefficient over the experiment is 0.524; in the model, the coefficient is 0.496. The explanation for the pull effect is apparent: moving the category prototype to the exemplar increases the category likelihood. Although many learning effects in humans are based on error feedback, the experimental studies showed that push and pull effects occur even in the absence of errors, as they do in MLGA-Distrib. The model simply assumes that the target category it used to generate an exemplar is the correct category when no feedback to the contrary is provided. As long as the likelihood gradient is nonzero, category prototypes will be shifted. Pull and push effects shrink over the course of the experiment in human studies, as they do in the simulation. Figure 5e shows a reduction in both pull and push, as measured by the shift of the prototype means toward or away from the exemplar. We measured the slope of MLGA-Distrib’s push function (Figure 2f) for trials in the first and second half of the simulation. The slope dropped from −0.042 to −0.025, as one would expect from Figure 5e. (These slopes are obtained by combining responses from 100 replications of the simulation. Consequently, each point on the push function was an average over 6000 trials, and therefore the regression slopes are highly reliable.) A quantitative, observable measure of pull is the standard deviation (SD) of responses. As push and pull effects diminish, SDs should decrease. In human subjects, the response SDs in the first and second half of the experiment are 0.43 and 0.33, respectively. In the simulation, the response SDs are 0.51 and 0.38. Shrink reflects the fact that the model is approaching a local optimum in log likelihood, causing gradients—and learning steps—to become smaller. Not all model parameter settings lead to shrink; as in any gradient-based algorithm, step sizes that are too large do not lead to converge. However, such parameter settings make little sense in the context of the learning objective. 4.3 Model predictions MLGA-Distrib produces greater pull of the target category toward the exemplar than push of the neighboring categories away from the exemplar. In the simulation, the magnitude of the target pull— measured by the movement of the prototype mean—is 0.105, contrasted with the neighbor push, which is 0.017. After observing this robust result in the simulation, we found pertinent experimental data. Using the categorization paradigm, Zotov et al. (2003) found that if the exemplar on trial t is near a category border, subjects are more likely to produce an error if the category on trial t − 1 is repeated (i.e., a pull effect just took place) than if the previous trial is of the neighboring category (i.e., a push effect), even when the distance between exemplars on t − 1 and t is matched. The greater probability of error translates to a greater magnitude of pull than push. The experimental studies noted a phenomenon termed snap back. If the same target category is presented on successive trials, and an error is made on the first trial, subjects perform very accurately on the second trial, i.e., they generate an exemplar near the true category prototype. It appears as if subjects, realizing they have been slacking, reawaken and snap the category prototype back to where it belongs. We tested the model, but observed a sort of anti snap back. If the model made an error on the first trial, the mean deviation was larger—not smaller—on the second trial: 0.40 versus 0.32. Thus, MLGA-Distrib fails to explain this phenomenon. However, the phenomenon is not inconsistent with the model. One might suppose that on an error trial, subjects become more attentive, and increased attention might correspond to a larger learning rate on an error trial, which should yield a more accurate response on the following trial. McLaren et al. (1995) studied a phenomenon in humans known as peak shift, in which subjects are trained to categorize unidimensional stimuli into one of two categories. Subjects are faster and more accurate when presented with exemplars far from the category boundary than those near the boundary. In fact, they respond more efficiently to far exemplars than they do to the category prototype. The results are characterized in terms of the prototype of one category being pushed away from the prototype of the other category. It seems straightforward to explain these data in MLGA-Distrib as a type of long-term push effect. 5 Related Work and Conclusions Stewart, Brown, and Chater (2002) proposed an account of categorization context effects in which responses are based solely on the relative difference between the previous and present exemplars. No representation of the category prototype is maintained. However, classification based solely on relative difference cannot account for a diminished bias effects as a function of experience. A long-term stable prototype representation, of the sort incorporated into our models, seems necessary. We considered four models in our investigation, and the fact that only one accounts for the experimental data suggests that the data are nontrivial. All four models have principled theoretical underpinnings, and they space they define may suggest other elegant frameworks for understanding mechanisms of category learning. The successful model, MLDA-Distrib, offers a deep insight into understanding multiple-category domains: category structure must be considered. MLGA-Distrib exploits knowledge available to subjects performing the task concerning the ordinal relationships among categories. A model without this knowledge, MLGA-Local, fails to explain data. Thus, the interrelatedness of categories appears to provide a source of constraint that individuals use in learning about the structure of the world. Acknowledgments This research was supported by NSF BCS 0339103 and NSF CSE-SMA 0509521. Support for the second author comes from an NSERC fellowship. References Jones, M. N., & Mewhort, D. J. K. (2003). Sequential contrast and assimilation effects in categorization of perceptual stimuli. Poster presented at the 44th Meeting of the Psychonomic Society. Vancouver, B.C. Maybeck, P.S. (1979). Stochastic models, estimation, and control, Volume I. Academic Press. McLaren, I. P. L., et al. (1995). Prototype effects and peak shift in categorization. JEP:LMC, 21, 662–673. Stewart, N. Brown, G. D. A., & Chater, N. (2002). Sequence effects in categorization of simple perceptual stimuli. JEP:LMC, 28, 3–11. Zotov, V., Jones, M. N., & Mewhort, D. J. K. (2003). Trial-to-trial representation shifts in categorization. Poster presented at the 13th Meeting of the Canadian Society for Brain, Behaviour, and Cognitive Science: Hamilton, Ontario.

4 0.52942723 189 nips-2006-Temporal dynamics of information content carried by neurons in the primary visual cortex

Author: Danko Nikolić, Stefan Haeusler, Wolf Singer, Wolfgang Maass

Abstract: We use multi-electrode recordings from cat primary visual cortex and investigate whether a simple linear classifier can extract information about the presented stimuli. We find that information is extractable and that it even lasts for several hundred milliseconds after the stimulus has been removed. In a fast sequence of stimulus presentation, information about both new and old stimuli is present simultaneously and nonlinear relations between these stimuli can be extracted. These results suggest nonlinear properties of cortical representations. The important implications of these properties for the nonlinear brain theory are discussed.

5 0.51958895 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing

Author: Joseph M. Kimmel, Richard M. Salter, Peter J. Thomas

Abstract: Chemical reaction networks by which individual cells gather and process information about their chemical environments have been dubbed “signal transduction” networks. Despite this suggestive terminology, there have been few attempts to analyze chemical signaling systems with the quantitative tools of information theory. Gradient sensing in the social amoeba Dictyostelium discoideum is a well characterized signal transduction system in which a cell estimates the direction of a source of diffusing chemoattractant molecules based on the spatiotemporal sequence of ligand-receptor binding events at the cell membrane. Using Monte Carlo techniques (MCell) we construct a simulation in which a collection of individual ligand particles undergoing Brownian diffusion in a three-dimensional volume interact with receptors on the surface of a static amoeboid cell. Adapting a method for estimation of spike train entropies described by Victor (originally due to Kozachenko and Leonenko), we estimate lower bounds on the mutual information between the transmitted signal (direction of ligand source) and the received signal (spatiotemporal pattern of receptor binding/unbinding events). Hence we provide a quantitative framework for addressing the question: how much could the cell know, and when could it know it? We show that the time course of the mutual information between the cell’s surface receptors and the (unknown) gradient direction is consistent with experimentally measured cellular response times. We find that the acquisition of directional information depends strongly on the time constant at which the intracellular response is filtered. 1 Introduction: gradient sensing in eukaryotes Biochemical signal transduction networks provide the computational machinery by which neurons, amoebae or other single cells sense and react to their chemical environments. The precision of this chemical sensing is limited by fluctuations inherent in reaction and diffusion processes involving a ∗ Current address: Computational Neuroscience Graduate Program, The University of Chicago. Oberlin Center for Computation and Modeling, http://occam.oberlin.edu/. ‡ To whom correspondence should be addressed. http://www.case.edu/artsci/math/thomas/thomas.html; Oberlin College Research Associate. † finite quantity of molecules [1, 2]. The theory of communication provides a framework that makes explicit the noise dependence of chemical signaling. For example, in any reaction A + B → C, we may view the time varying reactant concentrations A(t) and B(t) as input signals to a noisy channel, and the product concentration C(t) as an output signal carrying information about A(t) and B(t). In the present study we show that the mutual information between the (known) state of the cell’s surface receptors and the (unknown) gradient direction follows a time course consistent with experimentally measured cellular response times, reinforcing earlier claims that information theory can play a role in understanding biochemical cellular communication [3, 4]. Dictyostelium is a soil dwelling amoeba that aggregates into a multicellular form in order to survive conditions of drought or starvation. During aggregation individual amoebae perform chemotaxis, or chemically guided movement, towards sources of the signaling molecule cAMP, secreted by nearby amoebae. Quantitive studies have shown that Dictyostelium amoebae can sense shallow, static gradients of cAMP over long time scales (∼30 minutes), and that gradient steepness plays a crucial role in guiding cells [5]. The chemotactic efficiency (CE), the population average of the cosine between the cell displacement directions and the true gradient direction, peaks at a cAMP concentration of 25 nanoMolar, similar to the equilibrium constant for the cAMP receptor (the Keq is the concentration of cAMP at which the receptor has a 50% chance of being bound or unbound, respectively). For smaller or larger concentrations the CE dropped rapidly. Nevertheless over long times cells were able (on average) to detect gradients as small as 2% change in [cAMP] per cell length. At an early stage of development when the pattern of chemotactic centers and spirals is still forming, individual amoebae presumably experience an inchoate barrage of weak, noisy and conflicting directional signals. When cAMP binds receptors on a cell’s surface, second messengers trigger a chain of subsequent intracellular events including a rapid spatial reorganization of proteins involved in cell motility. Advances in fluorescence microscopy have revealed that the oriented subcellular response to cAMP stimulation is already well underway within two seconds [6, 7]. In order to understand the fundamental limits to communication in this cell signaling process we abstract the problem faced by a cell to that of rapidly identifying the direction of origin of a stimulus gradient superimposed on an existing mean background concentration. We model gradient sensing as an information channel in which an input signal – the direction of a chemical source – is noisily transmitted via a gradient of diffusing signaling molecules; and the “received signal” is the spatiotemporal pattern of binding events between cAMP and the cAMP receptors [8]. We neglect downstream intracellular events, which cannot increase the mutual information between the state of the cell and the direction of the imposed extracellular gradient [9]. The analysis of any signal transmission system depends on precise representation of the noise corrupting transmitted signals. We develop a Monte Carlo simulation (MCell, [10, 11]) in which a simulated cell is exposed to a cAMP distribution that evolves from a uniform background to a gradient at low (1 nMol) average concentration. The noise inherent in the communication of a diffusionmediated signal is accurately represented by this method. Our approach bridges both the transient and the steady state regimes and allows us to estimate the amount of stimulus-related information that is in principle available to the cell through its receptors as a function of time after stimulus initiation. Other efforts to address aspects of cell signaling using the conceptual tools of information theory have considered neurotransmitter release [3] and sensing temporal signals [4], but not gradient sensing in eukaryotic cells. A typical natural habitat for social amoebae such as Dictyostelium is the complex anisotropic threedimensional matrix of the forest floor. Under experimental conditions cells typically aggregate on a flat two-dimensional surface. We approach the problem of gradient sensing on a sphere, which is both harder and more natural for the ameoba, while still simple enough for us analytically and numerically. Directional data is naturally described using unit vectors in spherical coordinates, but the ameobae receive signals as binding events involving intramembrane protein complexes, so we have developed a method for projecting the ensemble of receptor bindings onto coordinates in R3 . In loose analogy with the chemotactic efficiency [5], we compare the projected directional estimate with the true gradient direction represented as a unit vector on S2 . Consistent with observed timing of the cell’s response to cAMP stimulation, we find that the directional signal converges quickly enough for the cell to make a decision about which direction to move within the first two seconds following stimulus onset. 2 Methods 2.1 Monte Carlo simulations Using MCell and DReAMM [10, 11] we construct a spherical cell (radius R = 7.5µm, [12]) centered in a cubic volume (side length L = 30µm). N = 980 triangular tiles partition the surface (mesh generated by DOME1 ); each contained one cell surface receptor for cAMP with binding rate k+ = 4.4 × 107 sec−1 M−1 , first-order cAMP unbinding rate k− = 1.1 sec−1 [12] and Keq = k− /k+ = 25nMol cAMP. We established a baseline concentration of approximately 1nMol by releasing a cAMP bolus at time 0 inside the cube with zero-flux boundary conditions imposed on each wall. At t = 2 seconds we introduced a steady flux at the x = −L/2 wall of 1 molecule of cAMP per square micron per msec, adding signaling molecules from the left. Simultaneously, the x = +L/2 wall of the cube assumes absorbing boundary conditions. The new boundary conditions lead (at equilibrium) to a linear gradient of 2 nMol/30µm, ranging from ≈ 2.0 nMol at the flux source wall to ≈ 0 nMol at the absorbing wall (see Figure 1); the concentration profile approaches this new steady state with time constant of approximately 1.25 msec. Sampling boxes centered along the planes x = ±13.5µm measured the local concentration, allowing us to validate the expected model behavior. Figure 1: Gradient sensing simulations performed with MCell (a Monte Carlo simulator of cellular microphysiology, http://www.mcell.cnl.salk.edu/) and rendered with DReAMM (Design, Render, and Animate MCell Models, http://www.mcell.psc.edu/). The model cell comprised a sphere triangulated with 980 tiles with one cAMP receptor per tile. Cell radius R = 7.5µm; cube side L = 30µm. Left: Initial equilibrium condition, before imposition of gradient. [cAMP] ≈ 1nMol (c. 15,000 molecules in the volume outside the sphere). Right: Gradient condition after transient (c. 15,000 molecules; see Methods for details). 2.2 Analysis 2.2.1 Assumptions We make the following assumptions to simplify the analysis of the distribution of receptor activities at equilibrium, whether pre- or post-stimulus onset: 1. Independence. At equilibrium, the state of each receptor (bound vs unbound) is independent of the states of the other receptors. 2. Linear Gradient. At equilibrium under the imposed gradient condition, the concentration of ligand molecule varies linearly with position along the gradient axis. 3. Symmetry. 1 http://nwg.phy.bnl.gov/∼bviren/uno/other/ (a) Rotational equivariance of receptor activities. In the absence of an applied gradient signal, the probability distribution describing the receptor states is equivariant with respect to arbitrary rotations of the sphere. (b) Rotational invariance of gradient direction. The imposed gradient seen by a model cell is equally likely to be coming from any direction; therefore the gradient direction vector is uniformly distributed over S2 . (c) Axial equivariance about the gradient direction. Once a gradient direction is imposed, the probability distribution describing receptor states is rotationally equivariant with respect to rotations about the axis parallel with the gradient. Berg and Purcell [1] calculate the inaccuracy in concentration estimates due to nonindependence of adjacent receptors; for our parameters (effective receptor radius = 5nm, receptor spacing ∼ 1µm) the fractional error in estimating concentration differences due to receptor nonindependence is negligible ( 10−11 ) [1, 2]. Because we fix receptors to be in 1:1 correspondence with surface tiles, spherical symmetry and uniform distribution of the receptors are only approximate. The gradient signal communicated via diffusion does not involve sharp spatial changes on the scale of the distance between nearby receptors, therefore spherical symmetry and uniform identical receptor distribution are good analytic approximations of the model configuration. By rotational equivariance we mean that combining any rotation of the sphere with a corresponding rotation of the indices labeling the N receptors, {j = 1, · · · , N }, leads to a statistically indistinguishable distribution of receptor activities. This same spherical symmetry is reflected in the a priori distribution of gradient directions, which is uniform over the sphere (with density 1/4π). Spherical symmetry is broken by the gradient signal, which fixes a preferred direction in space. About this axis however, we assume the system retains the rotational symmetry of the cylinder. 2.2.2 Mutual information of the receptors In order to quantify the directional information available to the cell from its surface receptors we construct an explicit model for the receptor states and the cell’s estimated direction. We model the receptor states via a collection of random variables {Bj } and develop an expression for the entropy of {Bj }. Then in section 2.2.3 we present a method for projecting a temporally filtered estimated direction, g , into three (rather than N ) dimensions. ˆ N Let the random variables {Bj }j=1 represent the states of the N cAMP receptors on the cell surface; Bj = 1 if the receptor is bound to a molecule of cAMP, otherwise Bj = 0. Let xj ∈ S2 represent the direction from the center of the center of the cell to the j th receptor. Invoking assumption 2 above, we take the equilibrium concentration of cAMP at x to be c(x|g) = a + b(x · g) where g ∈ S2 is a unit vector in the direction of the gradient. The parameter a is the mean concentration over the cell surface, and b = R| c| is half the drop in concentration from one extreme on the cell surface to the other. Before the stimulus begins, the gradient direction is undefined. It can be shown (see Supplemental Materials) that the entropy of receptor states given a fixed gradient direction g, H[{Bj }|g], is given by an integral over the sphere: π 2π a + b cos(θ) sin(θ) dφ dθ (as N → ∞). (1) a + b cos(θ) + Keq 4π θ=0 φ=0 On the other hand, if the gradient direction remains unspecified, the entropy of receptor states is given by H[{Bj }|g] ∼ N Φ π 2π θ=0 φ=0 H[{Bj }] ∼ N Φ a + b cos(θ) a + b cos(θ) + Keq sin(θ) dφ dθ (as N → ∞), 4π − (p log2 (p) + (1 − p) log2 (1 − p)) , 0 < p < 1 0, p = 0 or 1 binary random variable with state probabilities p and (1 − p). where Φ[p] = (2) denotes the entropy for a In both equations (1) and (2), the argument of Φ is a probability taking values 0 ≤ p ≤ 1. In (1) the values of Φ are averaged over the sphere; in (2) Φ is evaluated after averaging probabilities. Because Φ[p] is convex for 0 ≤ p ≤ 1, the integral in equation 1 cannot exceed that in equation 2. Therefore the mutual information upon receiving the signal is nonnegative (as expected): ∆ M I[{Bj }; g] = H[{Bj }] − H[{Bj }|g] ≥ 0. The analytic solution for equation (1) involves the polylogarithm function. For the parameters shown in the simulation (a = 1.078 nMol, b = .512 nMol, Keq = 25 nMol), the mutual information with 980 receptors is 2.16 bits. As one would expect, the mutual information peaks when the mean concentration is close to the Keq of the receptor, exceeding 16 bits when a = 25, b = 12.5 and Keq = 25 (nMol). 2.2.3 Dimension reduction The estimate obtained above does not give tell us how quickly the directional information available to the cell evolves over time. Direct estimate of the mutual information from stochastic simulations is impractical because the aggregate random variables occupy a 980 dimensional space that a limited number of simulation runs cannot sample adequately. Instead, we construct a deterministic function from the set of 980 time courses of the receptors, {Bj (t)}, to an aggregate directional estimate in R3 . Because of the cylindrical symmetry inherent in the system, our directional estimator g is ˆ an unbiased estimator of the true gradient direction g. The estimator g (t) may be thought of as ˆ representing a downstream chemical process that accumulates directional information and decays with some time constant τ . Let {xj }N be the spatial locations of the N receptors on the cell’s j=1 surface. Each vector is associated with a weight wj . Whenever the j th receptor binds a cAMP molecule, wj is incremented by one; otherwise wj decays with time constant τ . We construct an instantaneous estimate of the gradient direction from the linear combination of receptor positions, N gτ (t) = j=1 wj (t)xj . This procedure reflects the accumulation and reabsorption of intracellular ˆ second messengers released from the cell membrane upon receptor binding. Before the stimulus is applied, the weighted directional estimates gτ are small in absolute magniˆ tude, with direction uniformly distributed on S2 . In order to determine the information gained as the estimate vector evolves after stimulus application, we wish to determine the change in entropy in an ensemble of such estimates. As the cell gains information about the direction of the gradient signal from its receptors, the entropy of the estimate should decrease, leading to a rise in mutual information. By repeating multiple runs (M = 600) of the simulation we obtain samples from the ensemble of direction estimates, given a particular stimulus direction, g. In the method of Kozachenko and Leonenko [13], adapted for the analysis of neural spike train data by Victor [14] (“KLV method”), the cumulative distribution function is approximated directly from the observed samples, and the entropy is estimated via a change of variables transformation (see below). This method may be formulated in vector spaces Rd for d > 1 ([13]), but it is not guaranteed to be unbiased in the multivariate case [15] and has not been extended to curved manifolds such as the sphere. In the present case, however, we may exploit the symmetries inherent in the model (Assumptions 3a-3c) to reduce the empirical entropy estimation problem to one dimension. Adapting the argument in [14] to the case of spherical data from a distribution with rotational symmetry about a given axis, we obtain an estimate of the entropy based on a series of observations of the angles {θ1 , · · · , θM } between the estimates gτ and the true gradient direction g (for details, see ˆ Supplemental Materials): 1 H∼ M M log2 (λk ) + log2 (2(M − 1)) + k=1 γ + log2 (2π) + log2 (sin(θk )) loge (2) (3) ∆ (as M → ∞) where after sorting the θk in monotonic order, λk = min(|θk − θk±1 |) is the distance between each angle and its nearest neighbor in the sample, and γ is the Euler-Mascheroni constant. As shown in Figure 2, this approximation agrees with the analytic result for the uniform distribution, Hunif = log2 (4π) ≈ 3.651. 3 Results Figure 3 shows the results of M = 600 simulation runs. Panel A shows the concentration averaged across a set of 1µm3 sample boxes, four in the x = −13.5µm plane and four in the x = +13.5µm Figure 2: Monte Carlo simulation results and information analysis. A: Average concentration profiles along two planes perpendicular to the gradient, at x = ±13.5µm. B: Estimated direction vector (x, y, and z components; x = dark blue trace) gτ , τ = 500 msec. C: Entropy of the ensemble of diˆ rectional vector estimates for different values of the intracellular filtering time constant τ . Given the directions of the estimates θk , φk on each of M runs, we calculate the entropy of the ensemble using equation (3). All time constants yield uniformly distributed directional estimates in the pre-stimulus period, 0 ≤ t ≤ 2 (sec). After stimulus onset, directional estimates obtained with shorter time constants respond more quickly but achieve smaller gains in mutual information (smaller reductions in entropy). Filtering time constants τ range from lightest to darkest colors: 20, 50, 100, 200, 500, 1000, 2000 msec. plane. The initial bolus of cAMP released into the volume at t = 0 sec is not uniformly distributed, but spreads out evenly within 0.25 sec. At t = 2.0 sec the boundary conditions are changed, causing a gradient to emerge along a realistic time course. Consistent with the analytic solution for the mean concentration (not shown), the concentration approaches equilibrium more rapidly near the absorbing wall (descending trace) than at the imposed flux wall (ascending trace). Panel B shows the evolution of a directional estimate vector gτ for a single run, with τ = 500 ˆ msec. During uniform conditions all vectors fluctuate near the origin. After gradient onset the variance increases and the x component (dark trace) becomes biased towards the gradient source (g = [−1, 0, 0]) while the y and z components still have a mean of zero. Across all 600 runs the mean of the y and z components remains close to zero, while the mean of the x component systematically departs from zero shortly after stimulus onset (not shown). Hence the directional estimator is unbiased (as required by symmetry). See Supplemental Materials for the population average of g . ˆ Panel C shows the time course of the entropy of the ensemble of normalized directional estimate vectors gτ /|ˆτ | over M = 600 simulations, for intracellular filtering time constants ranging from 20 ˆ g msec to 2000 msec (light to dark shading), calculated using equation (3). Following stimulus onset, entropy decreases steadily, showing an increase in information available to the amoeba about the direction of the stimulus; the mutual information at a given point in time is the difference between the entropy at that time and before stimulus onset. For a cell with roughly 1000 receptors the mutual information has increased at most by ∼ 2 bits of information by one second (for τ = 500 msec), and at most by ∼ 3 bits of information by two seconds (for τ =1000 or 2000 msec), under our stimulation protocol. A one bit reduction in uncertainty is equivalent to identifying the correct value of the x component (positive versus negative) when the stimulus direction is aligned along the x-axis. Alternatively, note that a one bit reduction results in going from the uniform distribution on the sphere to the uniform distribution on one hemisphere. For τ ≤ 100 msec, the weighted average with decay time τ never gains more than one bit of information about the stimulus direction, even at long times. This observation suggestions that signaling must involve some chemical components with lifetimes longer than 100 msec. The τ = 200 msec filter saturates after about one second, at ∼ 1 bit of information gain. Longer lived second messengers would respond more slowly to changes from the background stimulus distribution, but would provide better more informative estimates over time. The τ = 500 msec estimate gains roughly two bits of information within 1.5 seconds, but not much more over time. Heuristically, we may think of a two bit gain in information as corresponding to the change from a uniform distribution to one covering uniformly covering one quarter of S2 , i.e. all points within π/3 of the true direction. Within two seconds the τ = 1000 msec and τ = 2000 msec weighted averages have each gained approximately three bits of information, equivalent to a uniform distribution covering all points with 0.23π or 41o of the true direction. 4 Discussion & conclusions Clearly there is an opportunity for more precise control of experimental conditions to deepen our understanding of spatio-temporal information processing at the membranes of gradient-sensitive cells. Efforts in this direction are now using microfluidic technology to create carefully regulated spatial profiles for probing cellular responses [16]. Our results suggest that molecular processes relevant to these responses must have lasting effects ≥ 100 msec. We use a static, immobile cell. Could cell motion relative to the medium increase sensitivity to changes in the gradient? No: the Dictyostelium velocity required to affect concentration perception is on order 1cm sec−1 [1], whereas reported velocities are on the order µm sec−1 [5]. The chemotactic response mechanism is known to begin modifying the cell membrane on the edge facing up the gradient within two seconds after stimulus initiation [7, 6], suggesting that the cell strikes a balance between gathering data and deciding quickly. Indeed, our results show that the reported activation of the G-protein signaling system on the leading edge of a chemotactically responsive cell [7] rises at roughly the same rate as the available chemotactic information. Results such as these ([7, 6]) are obtained by introducing a pipette into the medium near the amoeba; the magnitude and time course of cAMP release are not precisely known, and when estimated the cAMP concentration at the cell surface is over 25 nMol by a full order of magnitude. Thomson and Kristan [17] show that for discrete probability distributions and for continuous distributions over linear spaces, stimulus discriminability may be better quantified using ideal observer analysis (mean squared error, for continuous variables) than information theory. The machinery of mean squared error (variance, expectation) do not carry over to the case of directional data without fundamental modifications [18]; in particular the notion of mean squared error is best represented by the mean resultant length 0 ≤ ρ ≤ 1, the expected length of the vector average of a collection of unit vectors representing samples from directional data. A resultant with length ρ ≈ 1 corresponds to a highly focused probability density function on the sphere. In addition to measuring the mutual information between the gradient direction and an intracellular estimate of direction, we also calculated the time evolution of ρ (see Supplemental Materials.) We find that ρ rapidly approaches 1 and can exceed 0.9, depending on τ . We found that in this case at least the behavior of the mean resultant length and the mutual information are very similar; there is no evidence of discrepancies of the sort described in [17]. We have shown that the mutual information between an arbitrarily oriented stimulus and the directional signal available at the cell’s receptors evolves with a time course consistent with observed reaction times of Dictyostelium amoeba. Our results reinforce earlier claims that information theory can play a role in understanding biochemical cellular communication. Acknowledgments MCell simulations were run on the Oberlin College Beowulf Cluster, supported by NSF grant CHE0420717. References [1] Howard C. Berg and Edward M. Purcell. Physics of chemoreception. Biophysical Journal, 20:193, 1977. [2] William Bialek and Sima Setayeshgar. Physical limits to biochemical signaling. PNAS, 102(29):10040– 10045, July 19 2005. [3] S. Qazi, A. Beltukov, and B.A. Trimmer. Simulation modeling of ligand receptor interactions at nonequilibrium conditions: processing of noisy inputs by ionotropic receptors. Math Biosci., 187(1):93–110, Jan 2004. [4] D. J. Spencer, S. K. Hampton, P. Park, J. P. Zurkus, and P. J. Thomas. The diffusion-limited biochemical signal-relay channel. In S. Thrun, L. Saul, and B. Sch¨ lkopf, editors, Advances in Neural Information o Processing Systems 16. MIT Press, Cambridge, MA, 2004. [5] P.R. Fisher, R. Merkl, and G. Gerisch. Quantitative analysis of cell motility and chemotaxis in Dictyostelium discoideum by using an image processing system and a novel chemotaxis chamber providing stationary chemical gradients. J. Cell Biology, 108:973–984, March 1989. [6] Carole A. Parent, Brenda J. Blacklock, Wendy M. Froehlich, Douglas B. Murphy, and Peter N. Devreotes. G protein signaling events are activated at the leading edge of chemotactic cells. Cell, 95:81–91, 2 October 1998. [7] Xuehua Xu, Martin Meier-Schellersheim, Xuanmao Jiao, Lauren E. Nelson, and Tian Jin. Quantitative imaging of single live cells reveals spatiotemporal dynamics of multistep signaling events of chemoattractant gradient sensing in dictyostelium. Molecular Biology of the Cell, 16:676–688, February 2005. [8] Jan Wouter-Rappel, Peter. J Thomas, Herbert Levine, and William F. Loomis. Establishing direction during chemotaxis in eukaryotic cells. Biophys. J., 83:1361–1367, 2002. [9] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley, New York, 1990. [10] J. R. Stiles, D. Van Helden, T. M. Bartol, E.E. Salpeter, and M. M. Salpeter. Miniature endplate current rise times less than 100 microseconds from improved dual recordings can be modeled with passive acetylcholine diffusion from a synaptic vesicle. Proc. Natl. Acad. Sci. U.S.A., 93(12):5747–52, Jun 11 1996. [11] J. R. Stiles and T. M. Bartol. Computational Neuroscience: Realistic Modeling for Experimentalists, chapter Monte Carlo methods for realistic simulation of synaptic microphysiology using MCell, pages 87–127. CRC Press, Boca Raton, FL, 2001. [12] M. Ueda, Y. Sako, T. Tanaka, P. Devreotes, and T. Yanagida. Single-molecule analysis of chemotactic signaling in Dictyostelium cells. Science, 294:864–867, October 2001. [13] L.F. Kozachenko and N.N. Leonenko. Probl. Peredachi Inf. [Probl. Inf. Transm.], 23(9):95, 1987. [14] Jonathan D. Victor. Binless strategies for estimation of information from neural data. Physical Review E, 66:051903, Nov 11 2002. [15] Marc M. Van Hulle. Edgeworth approximation of multivariate differential entropy. Neural Computation, 17:1903–1910, 2005. [16] Loling Song, Sharvari M. Nadkarnia, Hendrik U. B¨ dekera, Carsten Beta, Albert Bae, Carl Franck, o Wouter-Jan Rappel, William F. Loomis, and Eberhard Bodenschatz. Dictyostelium discoideum chemotaxis: Threshold for directed motion. Euro. J. Cell Bio, 85(9-10):981–9, 2006. [17] Eric E. Thomson and William B. Kristan. Quantifying stimulus discriminability: A comparison of information theory and ideal observer analysis. Neural Computation, 17:741–778, 2005. [18] Kanti V. Mardia and Peter E. Jupp. Directional Statistics. John Wiley & Sons, West Sussex, England, 2000.

6 0.50000799 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

7 0.49140215 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

8 0.4507511 128 nips-2006-Manifold Denoising

9 0.42870092 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

10 0.42217934 41 nips-2006-Bayesian Ensemble Learning

11 0.41077161 113 nips-2006-Learning Structural Equation Models for fMRI

12 0.40375042 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

13 0.40007648 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

14 0.38191637 121 nips-2006-Learning to be Bayesian without Supervision

15 0.35327753 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes

16 0.35004941 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

17 0.34503466 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

18 0.34351689 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

19 0.34111288 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

20 0.33807874 98 nips-2006-Inferring Network Structure from Co-Occurrences


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.079), (3, 0.014), (7, 0.066), (9, 0.033), (20, 0.061), (22, 0.055), (44, 0.084), (57, 0.072), (65, 0.034), (69, 0.024), (71, 0.017), (90, 0.386)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83423948 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

Author: Michael D. Lee, Ian G. Fuss, Daniel J. Navarro

Abstract: We present a computational Bayesian approach for Wiener diffusion models, which are prominent accounts of response time distributions in decision-making. We first develop a general closed-form analytic approximation to the response time distributions for one-dimensional diffusion processes, and derive the required Wiener diffusion as a special case. We use this result to undertake Bayesian modeling of benchmark data, using posterior sampling to draw inferences about the interesting psychological parameters. With the aid of the benchmark data, we show the Bayesian account has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. 1

2 0.83042914 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan

Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1

3 0.4760772 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

Author: Yi Mao, Guy Lebanon

Abstract: We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. Using the M¨ bius transform, we express the model o as a simple convex optimization problem. Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization. 1

4 0.46331546 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

Author: Wenye Li, Kin-hong Lee, Kwong-sak Leung

Abstract: Kernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model’s complexity. Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. This paper investigates a generalized form of representer theorem for kernel-based learning. After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks.

5 0.44916752 174 nips-2006-Similarity by Composition

Author: Oren Boiman, Michal Irani

Abstract: We propose a new approach for measuring similarity between two signals, which is applicable to many machine learning tasks, and to many signal types. We say that a signal S1 is “similar” to a signal S2 if it is “easy” to compose S1 from few large contiguous chunks of S2 . Obviously, if we use small enough pieces, then any signal can be composed of any other. Therefore, the larger those pieces are, the more similar S1 is to S2 . This induces a local similarity score at every point in the signal, based on the size of its supported surrounding region. These local scores can in turn be accumulated in a principled information-theoretic way into a global similarity score of the entire S1 to S2 . “Similarity by Composition” can be applied between pairs of signals, between groups of signals, and also between different portions of the same signal. It can therefore be employed in a wide variety of machine learning problems (clustering, classification, retrieval, segmentation, attention, saliency, labelling, etc.), and can be applied to a wide range of signal types (images, video, audio, biological data, etc.) We show a few such examples. 1

6 0.43316975 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

7 0.43166438 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

8 0.43128991 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

9 0.42813075 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

10 0.4245404 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

11 0.42103428 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

12 0.41965047 115 nips-2006-Learning annotated hierarchies from relational data

13 0.41956159 192 nips-2006-Theory and Dynamics of Perceptual Bistability

14 0.4190529 53 nips-2006-Combining causal and similarity-based reasoning

15 0.41888091 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

16 0.41857985 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

17 0.41759047 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

18 0.41662925 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

19 0.41306311 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

20 0.4117988 16 nips-2006-A Theory of Retinal Population Coding