nips nips2012 nips2012-77 knowledge-graph by maker-knowledge-mining

77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models


Source: pdf

Author: Jeff Beck, Alexandre Pouget, Katherine A. Heller

Abstract: Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efficient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be naturally implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identification of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classification, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb’s rule and describe an extension of this work which allows us to deal with time varying and correlated latent causes. 1 Introduction to Probabilistic Inference in Cortex Probabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for dealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that behavioural experiments reliably demonstrate that humans and animals behave in a manner consistent with optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor [4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents probability distribution functions of task relevant stimuli rather than just single values. While there 1 are many ways to represent functions, Bayes rule tells us that when it comes to probability distribution functions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states that any pattern of activity, r, that efficiently represents a probability distribution over some task relevant quantity s, must satisfy the relationship p(s|r) ∝ p(r|s)p(s), where p(r|s) is the stimulus conditioned likelihood function that specifies the form of neural variability, p(s) gives the prior belief regarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given the representation r . Of course, it is unlikely that the nervous system consistently achieves this level of optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as characterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning machine such as the brain. The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). When this is the case, the precise form of the neural variability informs the nature of the neural code. For example, the exponential family of statistical models with linear sufficient statistics has been shown to be flexible enough to model the first and second order statistics of in vivo recordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i.e. log p(s|r) = h(s) · r − log Z(r) (1) Here, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard dot product, and Z(r) is the partition function which serves to normalize the posterior. This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. This includes, linear PPC implementations of cue combination[8], evidence integration over time, maximum likelihood and maximum a posterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman filtering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. Unfortunately, this research program has been a piecemeal effort that has largely proceeded by building neural networks designed deal with particular problems. As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. In this work, we will partially address this short coming by showing that Variation Bayesian Expectation Maximization (VBEM) algorithm provides a general scheme for approximate inference and learning with linear PPCs. In section 2, we briefly review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. Because this section describes the VB-PPC approach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. This result suggests that this particular non-linearity may be involved in very general and fundamental probabilistic computation, rather than simply playing a role in gain modulation. In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term fluctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. Properly implemented it can result a fast alternative to sampling based methods of inference such as MCMC[19] sampling. Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which specifies the joint distribution p(X, Θ, Z). This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i.e. we wish to compute, p(Z|X) ∝ p(X, Θ, Z)dΘ (2) When the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms simplify this marginalization by approximating the complex joint distribution over behaviourally relevant latents and parameters, p(Θ, Z|X), with a distribution q(Θ, Z) for which integrals of this form are easier to deal with in some sense. There is some art to choosing the particular form for the approximating distribution to make the above integral tractable, however, a factorized approximation is common, i.e. q(Θ, Z) = qΘ (Θ)qZ (Z). Regardless, for any given observation X, the approximate posterior is found by minimizing the Kullback-Leibler divergence between q(Θ, Z) and p(Θ, Z|X). When a factorized posterior is assumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm finds a local minimum of the KL divergence by iteratively updating, qΘ (Θ) and qZ (Z) according to the scheme n log qΘ (Θ) ∼ log p(X, Θ, Z) n qZ (Z) and n+1 log qZ (Z) ∼ log p(X, Θ, Z) n qΘ (Θ) (3) Here the brackets indicate an expected value taken with respect to the subscripted probability distribution function and the tilde indicates equality up to a constant which is independent of Θ and Z. The key property to note here is that the approximate posterior which results from this procedure is in an exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows for the straightforward construction of networks which implement the VBEM algorithm with linear PPC’s in the following way. If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . Z (4) Here the stimulus dependent kernels hZ (Z) and hΘ (Θ) are chosen so that their outer product results in a basis that spans the function space on Z × Θ given by log p(X, Θ, Z) for every X. This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq. 5 until convergence will find the VBEM approximation to the posterior. This is one way to build a neural network implementation of the VB algorithm. However, its not the only way. In general, any dynamical system which has stable fixed points in common with Eq. 5 can also be said to implement the VBEM algorithm. In the example below we will take advantage of this flexibility in order to build biologically plausible neural network implementations. 3 Response! to Mixture ! of Odors! Single Odor Response Cause Intensity Figure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identification depicted in Fig. 1. A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. This results in a highly distributed code for each odor. Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. A simple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA) [18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. It also assumes that there are K topics and that each of these topics appears in different proportions in each document, e.g. 80% of the words in a document might be concerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn from a distribution over words associated with that topic, e.g. when talking about coffee you have a 5% chance of using the word ’bitter’. The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. We can map the generative model for LDA onto the task of spike demixing in cortex by letting topics become latent causes or odors, words become neurons, word occurrences become spikes, word distributions associated with each topic become patterns of neural activity associated with each cause, and different documents become the observed patterns of neural activity on different trials. This equivalence is made explicit in Fig. 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. For each topic k = 1, . . . , K, (a) Distribution over words βk ∼ Dirichlet(η0 ) 2. For document d = 1, . . . , D, (a) Distribution over topics θd ∼ Dirichlet(α0 ) (b) For word m = 1, . . . , Ωd i. Topic assignment zd,m ∼ Multinomial(θd ) ii. Word assignment ωd,m ∼ Multinomial(βzm ) 1. For latent cause k = 1, . . . , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. For scene d = 1, . . . , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . . . , Ωd i. Cause assignment zd,m ∼ Multinomial(θd ) ii. Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. Text related attributes on the left, in red, have been replaced with neural attributes on the right, in green. rather than topic assignments. Specifically, we define, Rd,j to be the number of times neuron j fires during trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). (6) is equivalent to the topic models described above. Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. Note that, the problem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to that considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to infer a probability distribution over these quantities rather than maximum likelihood estimates. See supplement for details. Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . . . , D, with a factorized distribution of the form, qN (N)qc (c)qβ (β). This results in marginal posterior distributions q (β:,k |η:,k ), q cd,k |αd,k , C −1 + 1 ), and q (Nd,j,: | log pd,j,: , Rd,i ) which are Dirichlet, Gamma, and Multinomial respectively. Here, the parameters η:,k , αd,k , and log pd,j,: are the natural parameters of these distributions. The VBEM update algorithm yields update rules for these parameters which are summarized in Fig. 3 Algorithm1. Algorithm 1: Batch VB updates 1: while ηj,k not converged do 2: for d = 1, · · · , D do 3: while pd,j,k , αd,k not converged do 4: αd,k → α0 + j Rd,j pd,j,k 5: pd,j,k → Algorithm 2: Online VB updates 1: for d = 1, · · · , D do 2: reinitialize pj,k , αk ∀j, k 3: while pj,k , αk not converged do 4: αk → α0 + j Rd,j pj,k 5: pj,k → exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αk ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αi ) exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αd,k ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αd,i ) 6: end while 7: end for 8: ηj,k = η 0 + 9: end while end while ηj,k → (1 − dt)ηj,k + dt(η 0 + Rd,j pj,k ) 8: end for 6: 7: d Rd,j pd,j,k Figure 3: Here ηk = j ηj,k and ψ(x) is the digamma function so that exp ψ(x) is a smoothed ¯ threshold linear function. Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. Fortunately, an online version of this variational algorithm was recently proposed and shown to give 5 superior results when compared to the batch learning algorithm[21]. This algorithm replaces the sum over d in update equation for ηj,k with an incremental update based upon only the most recently observed pattern of spikes. See Fig. 3 Algorithm 2. 4.1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. One way to accomplish this is to simply assume that the firing rates of output neurons are equal to the positive valued αk parameters. Fig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. A continuous time dynamical system with this feature and the property that it shares its fixed points with the LDA algorithm is given by d ¯ Nj,k dt d αk dt ¯ ¯ = wj,k Rj − Rj Nj,k = (7) ¯ Nj,k exp (ψ (¯k )) (α0 − αk ) + exp (ψ (αk )) η (8) i ¯ where Rj = k wj,k exp (ψ (αk )), and wj,k = exp (ψ (ηj,k )). Note that, despite its form, it is Eq. 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . Regardless, this network has a variety of interesting properties that align well with biology. It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. It also predicts superlinear facilitation. Specifically, the final term on the right of Eq. 8 indicates that more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be implemented via recurrent excitation at the population level. In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally, the ηj in Eq. 8 represents a cell wide ’leak’ parameter that indicates that the total leak should be ¯ roughly proportional to the sum total weight of the synapses which drive the neuron. This predicts that cells that are highly sensitive to input should also decay back to baseline more quickly. This implementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the online update rule for the ηj,k parameters can be implemented by simply correlating the activity at ¯ each synapse, Nj,k with activity at the soma αj via the equation: τL d ¯ wj,k = exp (ψ (¯k )) (η0 − 1/2 − wj,k ) + Nj,k exp ψ (αk ) η dt (9) where τL is a long time constant for learning and we have used the fact that exp (ψ (ηjk )) ≈ ηjk −1/2 for x > 1. For a detailed derivation see the supplementary material. 5 Dynamic Document Model LDA is a rather simple generative model that makes several unrealistic assumptions about mixtures of sensory and cortical spikes. In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). We call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model smooth changes in the distribution of topics over the course of a single document. 5.1 DDM Model Thus the generative model for the DDM is as follows: 1. For latent cause k = 1, . . . , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. For scene t = 1, . . . , T , (a) Log intensity of causes c(t) ∼ Normal(Λct−1 , ΣD ) (b) Number of spikes in neuron j resulting from cause k, Nj,k (t) ∼ Poisson(βj,k exp ck (t)) (c) Number of spikes in neuron j, Rj (t) = k Nj,k (t) This model bears many similarities to the Correlated and Dynamic topic models[22], but models dynamics over a short time scale, where the dynamic relationship (Λ, ΣD ) is important. 5.2 Network Implementation Once again the quantity of interest is the current distribution of latent causes, p(c(t)|R(τ ), τ = 0..T ). If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman filter with parameters (Λ, ΣD ). A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman filter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. Thus, the patterns of neural activity which linearly encode them must do so as well. When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. All that is required is an additional inhibitory population which targets the somata in the output population. See Fig. 5. 7 Natural Parameters Natural Parameters (α) 0.4 200 450 180 0.3 Network Estimate Network Estimate 500 400 350 300 250 200 150 100 0.1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 140 120 0.4 0.3 100 0.2 80 0.1 0 60 40 0.4 20 50 0 0 0.2 160 0 0 0.3 0.2 20 40 60 80 100 120 VBEM Estimate VBEM Estimate 140 160 180 200 0.1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the matrix Σ−1 (t) and Σ−1 µ(t) of the distribution over log topic intensities. (Right) Three example traces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable) show MAP estimates for the network and VBEM algorithms. 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same fixed points as the standard VBEM algorithm. As a result, it is possible for the two algorithms to converge to different local minima of the KL divergence. For the network implementation of LDA we find good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). This agreement, however, is not perfect, especially when posterior predicted concentrations are low. In part, this is due to the fact we are presenting the network with difficult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. As a result, the objective function (KL divergence) is littered with local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. See Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for the VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. For simulation details please see the supplement. 7 Discussion and Conclusion In this work we presented a general framework for inference and learning with linear Probabilistic Population codes. This framework takes advantage of the fact that the Variational Bayesian Expectation Maximization algorithm generates approximate posterior distributions which are in an exponential family form. This is precisely the form needed in order to make probability distributions representable by a linear PPC. We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. We chose this problem because it has many of the features of the canonical problem faced by nearly every layer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer below. Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. 8 References [1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual review of psychology, 55:271–304, January 2004. [2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, 2002. [3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature neuroscience, 5(6):598–604, 2002. [4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6): 740–6, 2000. o [5] Konrad P K¨ rding and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427 (6971):244–7, 2004. [6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–15, 2004. [7] Erno T´ gl´ s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. e a Pure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033): 1054–9, 2011. [8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nature Neuroscience, 2006. [9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N Shadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision making. Neuron, 60(6):1142–52, 2008. [10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization. Journal of Neuroscience, 31(43):15310–15319, 2011. [11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075–80, 2007. [12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye movements. Nature, 1995. [13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–45, 2011. [14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy of Sciences, 1969. [15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior and neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011. [16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992. [17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 17(21):8621–44, 1997. [18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003. [19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL, 2003. [20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788–91, 1999. [21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010. [22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006. [23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational biology, 2011. 9

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. [sent-8, score-0.43]

2 However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. [sent-10, score-0.157]

3 We apply this approach to a generic problem faced by any given layer of cortex, namely the identification of latent causes of complex mixtures of spikes. [sent-13, score-0.422]

4 We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classification, in particular Latent Dirichlet Allocation (LDA). [sent-14, score-0.616]

5 We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. [sent-15, score-0.396]

6 This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. [sent-16, score-0.439]

7 The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). [sent-25, score-0.299]

8 When this is the case, the precise form of the neural variability informs the nature of the neural code. [sent-26, score-0.194]

9 When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i. [sent-28, score-0.235]

10 This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. [sent-31, score-0.215]

11 Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. [sent-32, score-0.545]

12 Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. [sent-34, score-0.45]

13 As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. [sent-36, score-0.337]

14 In section 2, we briefly review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. [sent-38, score-0.406]

15 As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). [sent-40, score-1.055]

16 In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. [sent-41, score-0.484]

17 We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. [sent-42, score-0.436]

18 In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. [sent-43, score-0.452]

19 This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. [sent-44, score-0.219]

20 In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. [sent-46, score-0.189]

21 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term fluctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. [sent-47, score-0.209]

22 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. [sent-48, score-0.147]

23 Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which specifies the joint distribution p(X, Θ, Z). [sent-50, score-0.398]

24 This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i. [sent-51, score-0.154]

25 If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . [sent-63, score-0.369]

26 This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. [sent-65, score-0.25]

27 This is one way to build a neural network implementation of the VB algorithm. [sent-69, score-0.305]

28 In the example below we will take advantage of this flexibility in order to build biologically plausible neural network implementations. [sent-73, score-0.38]

29 coffee) in isolation results in a pattern of neural activity (top). [sent-79, score-0.299]

30 When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). [sent-80, score-0.58]

31 (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. [sent-81, score-0.345]

32 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identification depicted in Fig. [sent-82, score-0.18]

33 A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. [sent-84, score-0.46]

34 Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. [sent-86, score-0.62]

35 Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). [sent-87, score-0.902]

36 In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. [sent-88, score-0.623]

37 The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. [sent-89, score-0.717]

38 Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. [sent-90, score-0.514]

39 Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. [sent-91, score-0.437]

40 Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. [sent-92, score-0.826]

41 LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. [sent-94, score-0.222]

42 The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. [sent-101, score-0.173]

43 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. [sent-104, score-0.287]

44 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. [sent-105, score-0.742]

45 For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. [sent-106, score-0.186]

46 , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. [sent-123, score-0.246]

47 , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . [sent-127, score-0.394]

48 Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. [sent-132, score-0.158]

49 (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. [sent-133, score-0.238]

50 Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. [sent-137, score-0.413]

51 These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. [sent-138, score-0.262]

52 If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). [sent-139, score-0.256]

53 Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. [sent-141, score-0.251]

54 Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . [sent-145, score-0.206]

55 Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. [sent-154, score-0.425]

56 1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. [sent-160, score-0.894]

57 A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. [sent-161, score-0.211]

58 Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. [sent-162, score-0.544]

59 Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. [sent-166, score-0.319]

60 The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . [sent-167, score-0.574]

61 The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. [sent-168, score-0.388]

62 This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. [sent-169, score-0.571]

63 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . [sent-172, score-0.255]

64 Regardless, this network has a variety of interesting properties that align well with biology. [sent-173, score-0.144]

65 It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. [sent-174, score-0.619]

66 Alternatively, this could be implemented via recurrent excitation at the population level. [sent-178, score-0.149]

67 In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. [sent-179, score-0.494]

68 In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. [sent-187, score-0.144]

69 Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. [sent-188, score-0.473]

70 Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. [sent-189, score-0.444]

71 intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. [sent-190, score-0.594]

72 This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. [sent-191, score-0.439]

73 We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). [sent-192, score-0.685]

74 , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. [sent-199, score-0.155]

75 If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman filter with parameters (Λ, ΣD ). [sent-208, score-0.321]

76 A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman filter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. [sent-209, score-0.651]

77 In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. [sent-210, score-0.248]

78 Thus, the patterns of neural activity which linearly encode them must do so as well. [sent-211, score-0.293]

79 When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. [sent-212, score-0.325]

80 As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. [sent-213, score-0.146]

81 As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. [sent-214, score-0.247]

82 Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. [sent-215, score-0.498]

83 1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. [sent-258, score-0.315]

84 (Right) Three example traces for cause intensity in the DDM. [sent-261, score-0.206]

85 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. [sent-263, score-0.355]

86 This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same fixed points as the standard VBEM algorithm. [sent-264, score-0.258]

87 For the network implementation of LDA we find good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. [sent-266, score-0.403]

88 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). [sent-268, score-0.365]

89 This agreement, however, is not perfect, especially when posterior predicted concentrations are low. [sent-269, score-0.241]

90 In part, this is due to the fact we are presenting the network with difficult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. [sent-270, score-0.379]

91 Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. [sent-272, score-0.259]

92 In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. [sent-273, score-0.185]

93 As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. [sent-277, score-0.184]

94 We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. [sent-282, score-0.379]

95 We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. [sent-283, score-0.522]

96 that of inferring the latent causes of complex mixtures of spike trains in the layer below. [sent-286, score-0.547]

97 Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. [sent-287, score-0.924]

98 Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. [sent-288, score-0.459]

99 Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. [sent-289, score-0.24]

100 Decoding the activity of neuronal populations in macaque primary visual cortex. [sent-341, score-0.204]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vbem', 0.376), ('ppc', 0.337), ('spike', 0.188), ('lda', 0.181), ('demixing', 0.18), ('odor', 0.18), ('activity', 0.172), ('divisive', 0.155), ('spikes', 0.155), ('olfactory', 0.155), ('network', 0.144), ('concentrations', 0.139), ('ddm', 0.139), ('causes', 0.126), ('topic', 0.124), ('cause', 0.124), ('neurons', 0.113), ('odors', 0.112), ('latent', 0.104), ('posterior', 0.102), ('neuron', 0.101), ('qz', 0.099), ('population', 0.096), ('probabilistic', 0.083), ('intensity', 0.082), ('vb', 0.08), ('dirichlet', 0.079), ('rj', 0.076), ('targeted', 0.075), ('neural', 0.074), ('neuroscience', 0.073), ('variational', 0.073), ('document', 0.071), ('biologically', 0.07), ('topics', 0.069), ('intensities', 0.067), ('normalization', 0.066), ('inference', 0.064), ('faced', 0.063), ('synapses', 0.063), ('word', 0.062), ('layer', 0.06), ('coffee', 0.059), ('stimulus', 0.059), ('beck', 0.053), ('pattern', 0.053), ('inhibition', 0.053), ('cortex', 0.053), ('recurrent', 0.053), ('codes', 0.052), ('soma', 0.052), ('rn', 0.05), ('bayesian', 0.05), ('generative', 0.05), ('exp', 0.049), ('latham', 0.049), ('documents', 0.049), ('capable', 0.048), ('patterns', 0.047), ('build', 0.046), ('nature', 0.046), ('plausible', 0.046), ('behaviourally', 0.045), ('dendritically', 0.045), ('hebb', 0.045), ('leak', 0.045), ('orn', 0.045), ('orns', 0.045), ('implementing', 0.041), ('implementation', 0.041), ('scene', 0.041), ('multinomial', 0.041), ('dt', 0.041), ('dynamics', 0.04), ('inhibitory', 0.04), ('curiously', 0.04), ('divisively', 0.04), ('operations', 0.039), ('alexandre', 0.038), ('assignment', 0.037), ('predicts', 0.037), ('foraging', 0.037), ('fz', 0.037), ('nij', 0.037), ('qc', 0.037), ('superlinear', 0.037), ('volatile', 0.037), ('dynamic', 0.037), ('complex', 0.036), ('circuits', 0.036), ('represents', 0.035), ('behavioural', 0.034), ('implements', 0.034), ('infer', 0.033), ('mixtures', 0.033), ('sensorimotor', 0.033), ('representable', 0.033), ('zm', 0.033), ('rule', 0.032), ('visual', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

Author: Jeff Beck, Alexandre Pouget, Katherine A. Heller

Abstract: Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efficient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be naturally implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identification of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classification, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb’s rule and describe an extension of this work which allows us to deal with time varying and correlated latent causes. 1 Introduction to Probabilistic Inference in Cortex Probabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for dealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that behavioural experiments reliably demonstrate that humans and animals behave in a manner consistent with optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor [4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents probability distribution functions of task relevant stimuli rather than just single values. While there 1 are many ways to represent functions, Bayes rule tells us that when it comes to probability distribution functions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states that any pattern of activity, r, that efficiently represents a probability distribution over some task relevant quantity s, must satisfy the relationship p(s|r) ∝ p(r|s)p(s), where p(r|s) is the stimulus conditioned likelihood function that specifies the form of neural variability, p(s) gives the prior belief regarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given the representation r . Of course, it is unlikely that the nervous system consistently achieves this level of optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as characterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning machine such as the brain. The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). When this is the case, the precise form of the neural variability informs the nature of the neural code. For example, the exponential family of statistical models with linear sufficient statistics has been shown to be flexible enough to model the first and second order statistics of in vivo recordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i.e. log p(s|r) = h(s) · r − log Z(r) (1) Here, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard dot product, and Z(r) is the partition function which serves to normalize the posterior. This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. This includes, linear PPC implementations of cue combination[8], evidence integration over time, maximum likelihood and maximum a posterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman filtering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. Unfortunately, this research program has been a piecemeal effort that has largely proceeded by building neural networks designed deal with particular problems. As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. In this work, we will partially address this short coming by showing that Variation Bayesian Expectation Maximization (VBEM) algorithm provides a general scheme for approximate inference and learning with linear PPCs. In section 2, we briefly review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. Because this section describes the VB-PPC approach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. This result suggests that this particular non-linearity may be involved in very general and fundamental probabilistic computation, rather than simply playing a role in gain modulation. In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term fluctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. Properly implemented it can result a fast alternative to sampling based methods of inference such as MCMC[19] sampling. Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which specifies the joint distribution p(X, Θ, Z). This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i.e. we wish to compute, p(Z|X) ∝ p(X, Θ, Z)dΘ (2) When the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms simplify this marginalization by approximating the complex joint distribution over behaviourally relevant latents and parameters, p(Θ, Z|X), with a distribution q(Θ, Z) for which integrals of this form are easier to deal with in some sense. There is some art to choosing the particular form for the approximating distribution to make the above integral tractable, however, a factorized approximation is common, i.e. q(Θ, Z) = qΘ (Θ)qZ (Z). Regardless, for any given observation X, the approximate posterior is found by minimizing the Kullback-Leibler divergence between q(Θ, Z) and p(Θ, Z|X). When a factorized posterior is assumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm finds a local minimum of the KL divergence by iteratively updating, qΘ (Θ) and qZ (Z) according to the scheme n log qΘ (Θ) ∼ log p(X, Θ, Z) n qZ (Z) and n+1 log qZ (Z) ∼ log p(X, Θ, Z) n qΘ (Θ) (3) Here the brackets indicate an expected value taken with respect to the subscripted probability distribution function and the tilde indicates equality up to a constant which is independent of Θ and Z. The key property to note here is that the approximate posterior which results from this procedure is in an exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows for the straightforward construction of networks which implement the VBEM algorithm with linear PPC’s in the following way. If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . Z (4) Here the stimulus dependent kernels hZ (Z) and hΘ (Θ) are chosen so that their outer product results in a basis that spans the function space on Z × Θ given by log p(X, Θ, Z) for every X. This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq. 5 until convergence will find the VBEM approximation to the posterior. This is one way to build a neural network implementation of the VB algorithm. However, its not the only way. In general, any dynamical system which has stable fixed points in common with Eq. 5 can also be said to implement the VBEM algorithm. In the example below we will take advantage of this flexibility in order to build biologically plausible neural network implementations. 3 Response! to Mixture ! of Odors! Single Odor Response Cause Intensity Figure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identification depicted in Fig. 1. A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. This results in a highly distributed code for each odor. Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. A simple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA) [18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. It also assumes that there are K topics and that each of these topics appears in different proportions in each document, e.g. 80% of the words in a document might be concerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn from a distribution over words associated with that topic, e.g. when talking about coffee you have a 5% chance of using the word ’bitter’. The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. We can map the generative model for LDA onto the task of spike demixing in cortex by letting topics become latent causes or odors, words become neurons, word occurrences become spikes, word distributions associated with each topic become patterns of neural activity associated with each cause, and different documents become the observed patterns of neural activity on different trials. This equivalence is made explicit in Fig. 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. For each topic k = 1, . . . , K, (a) Distribution over words βk ∼ Dirichlet(η0 ) 2. For document d = 1, . . . , D, (a) Distribution over topics θd ∼ Dirichlet(α0 ) (b) For word m = 1, . . . , Ωd i. Topic assignment zd,m ∼ Multinomial(θd ) ii. Word assignment ωd,m ∼ Multinomial(βzm ) 1. For latent cause k = 1, . . . , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. For scene d = 1, . . . , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . . . , Ωd i. Cause assignment zd,m ∼ Multinomial(θd ) ii. Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. Text related attributes on the left, in red, have been replaced with neural attributes on the right, in green. rather than topic assignments. Specifically, we define, Rd,j to be the number of times neuron j fires during trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). (6) is equivalent to the topic models described above. Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. Note that, the problem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to that considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to infer a probability distribution over these quantities rather than maximum likelihood estimates. See supplement for details. Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . . . , D, with a factorized distribution of the form, qN (N)qc (c)qβ (β). This results in marginal posterior distributions q (β:,k |η:,k ), q cd,k |αd,k , C −1 + 1 ), and q (Nd,j,: | log pd,j,: , Rd,i ) which are Dirichlet, Gamma, and Multinomial respectively. Here, the parameters η:,k , αd,k , and log pd,j,: are the natural parameters of these distributions. The VBEM update algorithm yields update rules for these parameters which are summarized in Fig. 3 Algorithm1. Algorithm 1: Batch VB updates 1: while ηj,k not converged do 2: for d = 1, · · · , D do 3: while pd,j,k , αd,k not converged do 4: αd,k → α0 + j Rd,j pd,j,k 5: pd,j,k → Algorithm 2: Online VB updates 1: for d = 1, · · · , D do 2: reinitialize pj,k , αk ∀j, k 3: while pj,k , αk not converged do 4: αk → α0 + j Rd,j pj,k 5: pj,k → exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αk ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αi ) exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αd,k ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αd,i ) 6: end while 7: end for 8: ηj,k = η 0 + 9: end while end while ηj,k → (1 − dt)ηj,k + dt(η 0 + Rd,j pj,k ) 8: end for 6: 7: d Rd,j pd,j,k Figure 3: Here ηk = j ηj,k and ψ(x) is the digamma function so that exp ψ(x) is a smoothed ¯ threshold linear function. Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. Fortunately, an online version of this variational algorithm was recently proposed and shown to give 5 superior results when compared to the batch learning algorithm[21]. This algorithm replaces the sum over d in update equation for ηj,k with an incremental update based upon only the most recently observed pattern of spikes. See Fig. 3 Algorithm 2. 4.1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. One way to accomplish this is to simply assume that the firing rates of output neurons are equal to the positive valued αk parameters. Fig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. A continuous time dynamical system with this feature and the property that it shares its fixed points with the LDA algorithm is given by d ¯ Nj,k dt d αk dt ¯ ¯ = wj,k Rj − Rj Nj,k = (7) ¯ Nj,k exp (ψ (¯k )) (α0 − αk ) + exp (ψ (αk )) η (8) i ¯ where Rj = k wj,k exp (ψ (αk )), and wj,k = exp (ψ (ηj,k )). Note that, despite its form, it is Eq. 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . Regardless, this network has a variety of interesting properties that align well with biology. It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. It also predicts superlinear facilitation. Specifically, the final term on the right of Eq. 8 indicates that more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be implemented via recurrent excitation at the population level. In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally, the ηj in Eq. 8 represents a cell wide ’leak’ parameter that indicates that the total leak should be ¯ roughly proportional to the sum total weight of the synapses which drive the neuron. This predicts that cells that are highly sensitive to input should also decay back to baseline more quickly. This implementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the online update rule for the ηj,k parameters can be implemented by simply correlating the activity at ¯ each synapse, Nj,k with activity at the soma αj via the equation: τL d ¯ wj,k = exp (ψ (¯k )) (η0 − 1/2 − wj,k ) + Nj,k exp ψ (αk ) η dt (9) where τL is a long time constant for learning and we have used the fact that exp (ψ (ηjk )) ≈ ηjk −1/2 for x > 1. For a detailed derivation see the supplementary material. 5 Dynamic Document Model LDA is a rather simple generative model that makes several unrealistic assumptions about mixtures of sensory and cortical spikes. In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). We call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model smooth changes in the distribution of topics over the course of a single document. 5.1 DDM Model Thus the generative model for the DDM is as follows: 1. For latent cause k = 1, . . . , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. For scene t = 1, . . . , T , (a) Log intensity of causes c(t) ∼ Normal(Λct−1 , ΣD ) (b) Number of spikes in neuron j resulting from cause k, Nj,k (t) ∼ Poisson(βj,k exp ck (t)) (c) Number of spikes in neuron j, Rj (t) = k Nj,k (t) This model bears many similarities to the Correlated and Dynamic topic models[22], but models dynamics over a short time scale, where the dynamic relationship (Λ, ΣD ) is important. 5.2 Network Implementation Once again the quantity of interest is the current distribution of latent causes, p(c(t)|R(τ ), τ = 0..T ). If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman filter with parameters (Λ, ΣD ). A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman filter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. Thus, the patterns of neural activity which linearly encode them must do so as well. When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. All that is required is an additional inhibitory population which targets the somata in the output population. See Fig. 5. 7 Natural Parameters Natural Parameters (α) 0.4 200 450 180 0.3 Network Estimate Network Estimate 500 400 350 300 250 200 150 100 0.1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 140 120 0.4 0.3 100 0.2 80 0.1 0 60 40 0.4 20 50 0 0 0.2 160 0 0 0.3 0.2 20 40 60 80 100 120 VBEM Estimate VBEM Estimate 140 160 180 200 0.1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the matrix Σ−1 (t) and Σ−1 µ(t) of the distribution over log topic intensities. (Right) Three example traces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable) show MAP estimates for the network and VBEM algorithms. 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same fixed points as the standard VBEM algorithm. As a result, it is possible for the two algorithms to converge to different local minima of the KL divergence. For the network implementation of LDA we find good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). This agreement, however, is not perfect, especially when posterior predicted concentrations are low. In part, this is due to the fact we are presenting the network with difficult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. As a result, the objective function (KL divergence) is littered with local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. See Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for the VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. For simulation details please see the supplement. 7 Discussion and Conclusion In this work we presented a general framework for inference and learning with linear Probabilistic Population codes. This framework takes advantage of the fact that the Variational Bayesian Expectation Maximization algorithm generates approximate posterior distributions which are in an exponential family form. This is precisely the form needed in order to make probability distributions representable by a linear PPC. We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. We chose this problem because it has many of the features of the canonical problem faced by nearly every layer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer below. Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. 8 References [1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual review of psychology, 55:271–304, January 2004. [2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, 2002. [3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature neuroscience, 5(6):598–604, 2002. [4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6): 740–6, 2000. o [5] Konrad P K¨ rding and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427 (6971):244–7, 2004. [6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–15, 2004. [7] Erno T´ gl´ s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. e a Pure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033): 1054–9, 2011. [8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nature Neuroscience, 2006. [9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N Shadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision making. Neuron, 60(6):1142–52, 2008. [10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization. Journal of Neuroscience, 31(43):15310–15319, 2011. [11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075–80, 2007. [12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye movements. Nature, 1995. [13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–45, 2011. [14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy of Sciences, 1969. [15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior and neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011. [16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992. [17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 17(21):8621–44, 1997. [18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003. [19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL, 2003. [20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788–91, 1999. [21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010. [22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006. [23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational biology, 2011. 9

2 0.23887701 190 nips-2012-Learning optimal spike-based representations

Author: Ralph Bourdoukan, David Barrett, Sophie Deneve, Christian K. Machens

Abstract: How can neural networks learn to represent information optimally? We answer this question by deriving spiking dynamics and learning dynamics directly from a measure of network performance. We find that a network of integrate-and-fire neurons undergoing Hebbian plasticity can learn an optimal spike-based representation for a linear decoder. The learning rule acts to minimise the membrane potential magnitude, which can be interpreted as a representation error after learning. In this way, learning reduces the representation error and drives the network into a robust, balanced regime. The network becomes balanced because small representation errors correspond to small membrane potentials, which in turn results from a balance of excitation and inhibition. The representation is robust because neurons become self-correcting, only spiking if the representation error exceeds a threshold. Altogether, these results suggest that several observed features of cortical dynamics, such as excitatory-inhibitory balance, integrate-and-fire dynamics and Hebbian plasticity, are signatures of a robust, optimal spike-based code. A central question in neuroscience is to understand how populations of neurons represent information and how they learn to do so. Usually, learning and information representation are treated as two different functions. From the outset, this separation seems like a good idea, as it reduces the problem into two smaller, more manageable chunks. Our approach, however, is to study these together. This allows us to treat learning and information representation as two sides of a single mechanism, operating at two different timescales. Experimental work has given us several clues about the regime in which real networks operate in the brain. Some of the most prominent observations are: (a) high trial-to-trial variability—a neuron responds differently to repeated, identical inputs [1, 2]; (b) asynchronous firing at the network level—spike trains of different neurons are at most very weakly correlated [3, 4, 5]; (c) tight balance of excitation and inhibition—every excitatory input is met by an inhibitory input of equal or greater size [6, 7, 8] and (4) spike-timing-dependent plasticity (STDP)—the strength of synapses change as a function of presynaptic and postsynaptic spike times [9]. Previously, it has been shown that observations (a)–(c) can be understood as signatures of an optimal, spike-based code [10, 11]. The essential idea is to derive spiking dynamics from the assumption that neurons only fire if their spike improves information representation. Information in a network may ∗ Authors contributed equally 1 originate from several possible sources: external sensory input, external neural network input, or alternatively, it may originate within the network itself as a memory, or as a computation. Whatever the source, this initial assumption leads directly to the conclusion that a network of integrate-and-fire neurons can optimally represent a signal while exhibiting properties (a)–(c). A major problem with this framework is that network connectivity must be completely specified a priori, and requires the tuning of N 2 parameters, where N is the number of neurons in the network. Although this is feasible mathematically, it is unclear how a real network could tune itself into this optimal regime. In this work, we solve this problem using a simple synaptic learning rule. The key insight is that the plasticity rule can be derived from the same basic principle as the spiking rule in the earlier work—namely, that any change should improve information representation. Surprisingly, this can be achieved with a local, Hebbian learning rule, where synaptic plasticity is proportional to the product of presynaptic firing rates with post-synaptic membrane potentials. Spiking and synaptic plasticity then work hand in hand towards the same goal: the spiking of a neuron decreases the representation error on a fast time scale, thereby giving rise to the actual population representation; synaptic plasticity decreases the representation error on a slower time scale, thereby improving or maintaining the population representation. For a large set of initial connectivities and spiking dynamics, neural networks are driven into a balanced regime, where excitation and inhibition cancel each other and where spike trains are asynchronous and irregular. Furthermore, the learning rule that we derive reproduces the main features of STDP (property (d) above). In this way, a network can learn to represent information optimally, with synaptic, neural and network dynamics consistent with those observed experimentally. 1 Derivation of the learning rule for a single neuron We begin by deriving a learning rule for a single neuron with an autapse (a self-connection) (Fig. 1A). Our approach is to derive synaptic dynamics for the autapse and spiking dynamics for the neuron such that the neuron learns to optimally represent a time-varying input signal. We will derive a learning rule for networks of neurons later, after we have developed the fundamental concepts for the single neuron case. Our first step is to derive optimal spiking dynamics for the neuron, so that we have a target for our learning rule. We do this by making two simple assumptions [11]. First, we assume that the neuron can provide an estimate or read-out x(t) of a time-dependent signal x(t) by filtering its spike train ˆ o(t) as follows: ˙ x(t) = −ˆ(t) + Γo(t), ˆ x (1) where Γ is a fixed read-out weight, which we will refer to as the neuron’s “output kernel” and the spike train can be written as o(t) = i δ(t − ti ), where {ti } are the spike times. Next, we assume that the neuron only produces a spike if that spike improves the read-out, where we measure the read-out performance through a simple squared-error loss function: 2 L(t) = x(t) − x(t) . ˆ (2) With these two assumptions, we can now derive optimal spiking dynamics. First, we observe that if the neuron produces an additional spike at time t, the read-out increases by Γ, and the loss function becomes L(t|spike) = (x(t) − (x(t) + Γ))2 . This allows us to restate our spiking rule as follows: ˆ the neuron should only produce a spike if L(t|no spike) > L(t|spike), or (x(t) − x(t))2 > (x(t) − ˆ (x(t) + Γ))2 . Now, squaring both sides of this inequality, defining V (t) ≡ Γ(x(t) − x(t)) and ˆ ˆ defining T ≡ Γ2 /2 we find that the neuron should only spike if: V (t) > T. (3) We interpret V (t) to be the membrane potential of the neuron, and we interpret T as the spike threshold. This interpretation allows us to understand the membrane potential functionally: the voltage is proportional to a prediction error—the difference between the read-out x(t) and the actual ˆ signal x(t). A spike is an error reduction mechanism—the neuron only spikes if the error exceeds the spike threshold. This is a greedy minimisation, in that the neuron fires a spike whenever that action decreases L(t) without considering the future impact of that spike. Importantly, the neuron does not require direct access to the loss function L(t). 2 To determine the membrane potential dynamics, we take the derivative of the voltage, which gives ˙ ˙ us V = Γ(x − x). (Here, and in the following, we will drop the time index for notational brevity.) ˙ ˆ ˙ Now, using Eqn. (1) we obtain V = Γx − Γ(−x + Γo) = −Γ(x − x) + Γ(x + x) − Γ2 o, so that: ˙ ˆ ˆ ˙ ˙ V = −V + Γc − Γ2 o, (4) where c = x + x is the neural input. This corresponds exactly to the dynamics of a leaky integrate˙ and-fire neuron with an inhibitory autapse1 of strength Γ2 , and a feedforward connection strength Γ. The dynamics and connectivity guarantee that a neuron spikes at just the right times to optimise the loss function (Fig. 1B). In addition, it is especially robust to noise of different forms, because of its error-correcting nature. If x is constant in time, the voltage will rise up to the threshold T at which point a spike is fired, adding a delta function to the spike train o at time t, thereby producing a read-out x that is closer to x and causing an instantaneous drop in the voltage through the autapse, ˆ by an amount Γ2 = 2T , effectively resetting the voltage to V = −T . We now have a target for learning—we know the connection strength that a neuron must have at the end of learning if it is to represent information optimally, for a linear read-out. We can use this target to derive synaptic dynamics that can learn an optimal representation from experience. Specifically, we consider an integrate-and-fire neuron with some arbitrary autapse strength ω. The dynamics of this neuron are given by ˙ V = −V + Γc − ωo. (5) This neuron will not produce the correct spike train for representing x through a linear read-out (Eqn. (1)) unless ω = Γ2 . Our goal is to derive a dynamical equation for the synapse ω so that the spike train becomes optimal. We do this by quantifying the loss that we are incurring by using the suboptimal strength, and then deriving a learning rule that minimises this loss with respect to ω. The loss function underlying the spiking dynamics determined by Eqn. (5) can be found by reversing the previous membrane potential analysis. First, we integrate the differential equation for V , assuming that ω changes on time scales much slower than the membrane potential. We obtain the following (formal) solution: V = Γx − ω¯, o (6) ˙ where o is determined by o = −¯ + o. The solution to this latter equation is o = h ∗ o, a convolution ¯ ¯ o ¯ of the spike train with the exponential kernel h(τ ) = θ(τ ) exp(−τ ). As such, it is analogous to the instantaneous firing rate of the neuron. Now, using Eqn. (6), and rewriting the read-out as x = Γ¯, we obtain the loss incurred by the ˆ o sub-optimal neuron, L = (x − x)2 = ˆ 1 V 2 + 2(ω − Γ2 )¯ + (ω − Γ2 )2 o2 . o ¯ Γ2 (7) We observe that the last two terms of Eqn. (7) will vanish whenever ω = Γ2 , i.e., when the optimal reset has been found. We can therefore simplify the problem by defining an alternative loss function, 1 2 V , (8) 2 which has the same minimum as the original loss (V = 0 or x = x, compare Eqn. (2)), but yields a ˆ simpler learning algorithm. We can now calculate how changes to ω affect LV : LV = ∂LV ∂V ∂o ¯ =V = −V o − V ω ¯ . (9) ∂ω ∂ω ∂ω We can ignore the last term in this equation (as we will show below). Finally, using simple gradient descent, we obtain a simple Hebbian-like synaptic plasticity rule: τω = − ˙ ∂LV = V o, ¯ ∂ω (10) where τ is the learning time constant. 1 This contribution of the autapse can also be interpreted as the reset of an integrate-and-fire neuron. Later, when we generalise to networks of neurons, we shall employ this interpretation. 3 This synaptic learning rule is capable of learning the synaptic weight ω that minimises the difference between x and x (Fig. 1B). During learning, the synaptic weight changes in proportion to the postˆ synaptic voltage V and the pre-synaptic firing rate o (Fig. 1C). As such, this is a Hebbian learning ¯ rule. Of course, in this single neuron case, the pre-synaptic neuron and post-synaptic neuron are the same neuron. The synaptic weight gradually approaches its optimal value Γ2 . However, it never completely stabilises, because learning never stops as long as neurons are spiking. Instead, the synapse oscillates closely about the optimal value (Fig. 1D). This is also a “greedy” learning rule, similar to the spiking rule, in that it seeks to minimise the error at each instant in time, without regard for the future impact of those changes. To demonstrate that the second term in Eqn. (5) can be neglected we note that the equations for V , o, and ω define a system ¯ of coupled differential equations that can be solved analytically by integrating between spikes. This results in a simple recurrence relation for changes in ω from the ith to the (i + 1)th spike, ωi+1 = ωi + ωi (ωi − 2T ) . τ (T − Γc − ωi ) (11) This iterative equation has a single stable fixed point at ω = 2T = Γ2 , proving that the neuron’s autaptic weight or reset will approach the optimal solution. 2 Learning in a homogeneous network We now generalise our learning rule derivation to a network of N identical, homogeneously connected neurons. This generalisation is reasonably straightforward because many characteristics of the single neuron case are shared by a network of identical neurons. We will return to the more general case of heterogeneously connected neurons in the next section. We begin by deriving optimal spiking dynamics, as in the single neuron case. This provides a target for learning, which we can then use to derive synaptic dynamics. As before, we want our network to produce spikes that optimally represent a variable x for a linear read-out. We assume that the read-out x is provided by summing and filtering the spike trains of all the neurons in the network: ˆ ˙ x = −ˆ + Γo, ˆ x (12) 2 where the row vector Γ = (Γ, . . . , Γ) contains the read-out weights of the neurons and the column vector o = (o1 , . . . , oN ) their spike trains. Here, we have used identical read-out weights for each neuron, because this indirectly leads to homogeneous connectivity, as we will demonstrate. Next, we assume that a neuron only spikes if that spike reduces a loss-function. This spiking rule is similar to the single neuron spiking rule except that this time there is some ambiguity about which neuron should spike to represent a signal. Indeed, there are many different spike patterns that provide exactly the same estimate x. For example, one neuron could fire regularly at a high rate (exactly like ˆ our previous single neuron example) while all others are silent. To avoid this firing rate ambiguity, we use a modified loss function, that selects amongst all equivalent solutions, those with the smallest neural firing rates. We do this by adding a ‘metabolic cost’ term to our loss function, so that high firing rates are penalised: ¯ L = (x − x)2 + µ o 2 , ˆ (13) where µ is a small positive constant that controls the cost-accuracy trade-off, akin to a regularisation parameter. Each neuron in the optimal network will seek to reduce this loss function by firing a spike. Specifically, the ith neuron will spike whenever L(no spike in i) > L(spike in i). This leads to the following spiking rule for the ith neuron: Vi > Ti (14) where Vi ≡ Γ(x − x) − µoi and Ti ≡ Γ2 /2 + µ/2. We can naturally interpret Vi as the membrane ˆ potential of the ith neuron and Ti as the spiking threshold of that neuron. As before, we can now derive membrane potential dynamics: ˙ V = −V + ΓT c − (ΓT Γ + µI)o, 2 (15) The read-out weights must scale as Γ ∼ 1/N so that firing rates are not unrealistically small in large networks. We can see this by calculating the average firing rate N oi /N ≈ x/(ΓN ) ∼ O(N/N ) ∼ O(1). i=1 ¯ 4 where I is the identity matrix and ΓT Γ + µI is the network connectivity. We can interpret the selfconnection terms {Γ2 +µ} as voltage resets that decrease the voltage of any neuron that spikes. This optimal network is equivalent to a network of identical integrate-and-fire neurons with homogeneous inhibitory connectivity. The network has some interesting dynamical properties. The voltages of all the neurons are largely synchronous, all increasing to the spiking threshold at about the same time3 (Fig. 1F). Nonetheless, neural spiking is asynchronous. The first neuron to spike will reset itself by Γ2 + µ, and it will inhibit all the other neurons in the network by Γ2 . This mechanism prevents neurons from spik- x 3 The first neuron to spike will be random if there is some membrane potential noise. V (A) (B) x x ˆ x 10 1 0.1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 1 D 0.5 V V 0 ˆ x V ˆ x (C) 1 0 1 2 0 0.625 25 25.625 (D) start of learning 1 V 50 200.625 400 400.625 1 2.4 O 1.78 ω 1.77 25 neuron$ 0 1 2 !me$ 3 4 25 1 5 V 400.625 !me$ (F) 25 1 2.35 1.05 1.049 400 25.625 !me$ (E) neuron$ 100.625 200 end of learning 1.4 1.35 ω 100 !me$ 1 V 1 O 50.625 0 1 2 !me$ 3 4 5 V !me$ !me$ Figure 1: Learning in a single neuron and a homogeneous network. (A) A single neuron represents an input signal x by producing an output x. (B) During learning, the single neuron output x (solid red ˆ ˆ line, top panel) converges towards the input x (blue). Similarly, for a homogeneous network the output x (dashed red line, top panel) converges towards x. Connectivity also converges towards optimal ˆ connectivity in both the single neuron case (solid black line, middle panel) and the homogeneous net2 2 work case (dashed black line, middle panel), as quantified by D = maxi,j ( Ωij − Ωopt / Ωopt ) ij ij at each point in time. Consequently, the membrane potential reset (bottom panel) converges towards the optimal reset (green line, bottom panel). Spikes are indicated by blue vertical marks, and are produced when the membrane potential reaches threshold (bottom panel). Here, we have rescaled time, as indicated, for clarity. (C) Our learning rule dictates that the autapse ω in our single neuron (bottom panel) changes in proportion to the membrane potential (top panel) and the firing rate (middle panel). (D) At the end of learning, the reset ω fluctuates weakly about the optimal value. (E) For a homogeneous network, neurons spike regularly at the start of learning, as shown in this raster plot. Membrane potentials of different neurons are weakly correlated. (F) At the end of learning, spiking is very irregular and membrane potentials become more synchronous. 5 ing synchronously. The population as a whole acts similarly to the single neuron in our previous example. Each neuron fires regularly, even if a different neuron fires in every integration cycle. The design of this optimal network requires the tuning of N (N − 1) synaptic parameters. How can an arbitrary network of integrate-and-fire neurons learn this optimum? As before, we address this question by using the optimal network as a target for learning. We start with an arbitrarily connected network of integrate-and-fire neurons: ˙ V = −V + ΓT c − Ωo, (16) where Ω is a matrix of connectivity weights, which includes the resets of the individual neurons. Assuming that learning occurs on a slow time scale, we can rewrite this equation as V = ΓT x − Ω¯ . o (17) Now, repeating the arguments from the single neuron derivation, we modify the loss function to obtain an online learning rule. Specifically, we set LV = V 2 /2, and calculate the gradient: ∂LV = ∂Ωij Vk k ∂Vk =− ∂Ωij Vk δki oj − ¯ k Vk Ωkl kl ∂ ol ¯ . ∂Ωij (18) We can simplify this equation considerably by observing that the contribution of the second summation is largely averaged out under a wide variety of realistic conditions4 . Therefore, it can be neglected, and we obtain the following local learning rule: ∂LV ˙ = V i oj . ¯ τ Ωij = − ∂Ωij (19) This is a Hebbian plasticity rule, whereby connectivity changes in proportion to the presynaptic firing rate oj and post-synaptic membrane potential Vi . We assume that the neural thresholds are set ¯ to a constant T and that the neural resets are set to their optimal values −T . In the previous section we demonstrated that these resets can be obtained by a Hebbian plasticity rule (Eqn. (10)). This learning rule minimises the difference between the read-out and the signal, by approaching the optimal recurrent connection strengths for the network (Fig. 1B). As in the single neuron case, learning does not stop, so the connection strengths fluctuate close to their optimal value. During learning, network activity becomes progressively more asynchronous as it progresses towards optimal connectivity (Fig. 1E, F). 3 Learning in the general case Now that we have developed the fundamental concepts underlying our learning rule, we can derive a learning rule for the more general case of a network of N arbitrarily connected leaky integrateand-fire neurons. Our goal is to understand how such networks can learn to optimally represent a ˙ J-dimensional signal x = (x1 , . . . , xJ ), using the read-out equation x = −x + Γo. We consider a network with the following membrane potential dynamics: ˙ V = −V + ΓT c − Ωo, (20) where c is a J-dimensional input. We assume that this input is related to the signal according to ˙ c = x + x. This assumption can be relaxed by treating the input as the control for an arbitrary linear dynamical system, in which case the signal represented by the network is the output of such a computation [11]. However, this further generalisation is beyond the scope of this work. As before, we need to identify the optimal recurrent connectivity so that we have a target for learning. Most generally, the optimal recurrent connectivity is Ωopt ≡ ΓT Γ + µI. The output kernels of the individual neurons, Γi , are given by the rows of Γ, and their spiking thresholds by Ti ≡ Γi 2 /2 + 4 From the definition of the membrane potential we can see that Vk ∼ O(1/N ) because Γ ∼ 1/N . Therefore, the size of the first term in Eqn. (18) is k Vk δki oj = Vi oj ∼ O(1/N ). Therefore, the second term can ¯ ¯ be ignored if kl Vk Ωkl ∂ ol /∂Ωij ¯ O(1/N ). This happens if Ωkl O(1/N 2 ) as at the start of learning. It also happens towards the end of learning if the terms {Ωkl ∂ ol /∂Ωij } are weakly correlated with zero mean, ¯ or if the membrane potentials {Vi } are weakly correlated with zero mean. 6 µ/2. With these connections and thresholds, we find that a network of integrate-and-fire neurons ˆ ¯ will produce spike trains in such a way that the loss function L = x − x 2 + µ o 2 is minimised, ˆ where the read-out is given by x = Γ¯ . We can show this by prescribing a greedy5 spike rule: o a spike is fired by neuron i whenever L(no spike in i) > L(spike in i) [11]. The resulting spike generation rule is Vi > Ti , (21) ˆ where Vi ≡ ΓT (x − x) − µ¯i is interpreted as the membrane potential. o i 5 Despite being greedy, this spiking rule can generate firing rates that are practically identical to the optimal solutions: we checked this numerically in a large ensemble of networks with randomly chosen kernels. (A) x1 … x … 1 1 (B) xJJ x 10 L 10 T T 10 4 6 8 1 Viii V D ˆˆ ˆˆ x11 xJJ x x F 0.5 0 0.4 … … 0.2 0 0 2000 4000 !me   (C) x V V 1 x 10 x 3 ˆ x 8 0 x 10 1 2 3 !me   4 5 4 0 1 4 0 1 8 V (F) Ρ(Δt)   E-­‐I  input   0.4 ˆ x 0 3 0 1 x 10 1.3 0.95 x 10 ˆ x 4 V (E) 1 x 0 end of learning 50 neuron neuron 50 !me   2 0 ˆ x 0 0.5 ISI  Δt     1 2 !me   4 5 4 1.5 1.32 3 2 0.1 Ρ(Δt)   x E-­‐I  input   (D) start of learning 0 2 !me   0 0 0.5 ISI  Δt   1 Figure 2: Learning in a heterogeneous network. (A) A network of neurons represents an input ˆ signal x by producing an output x. (B) During learning, the loss L decreases (top panel). The difference between the connection strengths and the optimal strengths also decreases (middle panel), as 2 2 quantified by the mean difference (solid line), given by D = Ω − Ωopt / Ωopt and the maxi2 2 mum difference (dashed line), given by maxi,j ( Ωij − Ωopt / Ωopt ). The mean population firing ij ij rate (solid line, bottom panel) also converges towards the optimal firing rate (dashed line, bottom panel). (C, E) Before learning, a raster plot of population spiking shows that neurons produce bursts ˆ of spikes (upper panel). The network output x (red line, middle panel) fails to represent x (blue line, middle panel). The excitatory input (red, bottom left panel) and inhibitory input (green, bottom left panel) to a randomly selected neuron is not tightly balanced. Furthermore, a histogram of interspike intervals shows that spiking activity is not Poisson, as indicated by the red line that represents a best-fit exponential distribution. (D, F) At the end of learning, spiking activity is irregular and ˆ Poisson-like, excitatory and inhibitory input is tightly balanced and x matches x. 7 How can we learn this optimal connection matrix? As before, we can derive a learning rule by minimising the cost function LV = V 2 /2. This leads to a Hebbian learning rule with the same form as before: ˙ τ Ωij = Vi oj . ¯ (22) Again, we assume that the neural resets are given by −Ti . Furthermore, in order for this learning rule to work, we must assume that the network input explores all possible directions in the J-dimensional input space (since the kernels Γi can point in any of these directions). The learning performance does not critically depend on how the input variable space is sampled as long as the exploration is extensive. In our simulations, we randomly sample the input c from a Gaussian white noise distribution at every time step for the entire duration of the learning. We find that this learning rule decreases the loss function L, thereby approaching optimal network connectivity and producing optimal firing rates for our linear decoder (Fig. 2B). In this example, we have chosen connectivity that is initially much too weak at the start of learning. Consequently, the initial network behaviour is similar to a collection of unconnected single neurons that ignore each other. Spike trains are not Poisson-like, firing rates are excessively large, excitatory and inhibitory ˆ input is unbalanced and the decoded variable x is highly unreliable (Fig. 2C, E). As a result of learning, the network becomes tightly balanced and the spike trains become asynchronous, irregular and Poisson-like with much lower rates (Fig. 2D, F). However, despite this apparent variability, the population representation is extremely precise, only limited by the the metabolic cost and the discrete nature of a spike. This learnt representation is far more precise than a rate code with independent Poisson spike trains [11]. In particular, shuffling the spike trains in response to identical inputs drastically degrades this precision. 4 Conclusions and Discussion In population coding, large trial-to-trial spike train variability is usually interpreted as noise [2]. We show here that a deterministic network of leaky integrate-and-fire neurons with a simple Hebbian plasticity rule can self-organise into a regime where information is represented far more precisely than in noisy rate codes, while appearing to have noisy Poisson-like spiking dynamics. Our learning rule (Eqn. (22)) has the basic properties of STDP. Specifically, a presynaptic spike occurring immediately before a post-synaptic spike will potentiate a synapse, because membrane potentials are positive immediately before a postsynaptic spike. Furthermore, a presynaptic spike occurring immediately after a post-synaptic spike will depress a synapse, because membrane potentials are always negative immediately after a postsynaptic spike. This is similar in spirit to the STDP rule proposed in [12], but different to classical STDP, which depends on post-synaptic spike times [9]. This learning rule can also be understood as a mechanism for generating a tight balance between excitatory and inhibitory input. We can see this by observing that membrane potentials after learning can be interpreted as representation errors (projected onto the read-out kernels). Therefore, learning acts to minimise the magnitude of membrane potentials. Excitatory and inhibitory input must be balanced if membrane potentials are small, so we can equate balance with optimal information representation. Previous work has shown that the balanced regime produces (quasi-)chaotic network dynamics, thereby accounting for much observed cortical spike train variability [13, 14, 4]. Moreover, the STDP rule has been known to produce a balanced regime [16, 17]. Additionally, recent theoretical studies have suggested that the balanced regime plays an integral role in network computation [15, 13]. In this work, we have connected these mechanisms and functions, to conclude that learning this balance is equivalent to the development of an optimal spike-based population code, and that this learning can be achieved using a simple Hebbian learning rule. Acknowledgements We are grateful for generous funding from the Emmy-Noether grant of the Deutsche Forschungsgemeinschaft (CKM) and the Chaire d’excellence of the Agence National de la Recherche (CKM, DB), as well as a James Mcdonnell Foundation Award (SD) and EU grants BACS FP6-IST-027140, BIND MECT-CT-20095-024831, and ERC FP7-PREDSPIKE (SD). 8 References [1] Tolhurst D, Movshon J, and Dean A (1982) The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res 23: 775–785. [2] Shadlen MN, Newsome WT (1998) The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci 18(10): 3870–3896. [3] Zohary E, Newsome WT (1994) Correlated neuronal discharge rate and its implication for psychophysical performance. Nature 370: 140–143. [4] Renart A, de la Rocha J, Bartho P, Hollender L, Parga N, Reyes A, & Harris, KD (2010) The asynchronous state in cortical circuits. Science 327, 587–590. [5] Ecker AS, Berens P, Keliris GA, Bethge M, Logothetis NK, Tolias AS (2010) Decorrelated neuronal firing in cortical microcircuits. Science 327: 584–587. [6] Okun M, Lampl I (2008) Instantaneous correlation of excitation and inhibition during ongoing and sensory-evoked activities. Nat Neurosci 11, 535–537. [7] Shu Y, Hasenstaub A, McCormick DA (2003) Turning on and off recurrent balanced cortical activity. Nature 423, 288–293. [8] Gentet LJ, Avermann M, Matyas F, Staiger JF, Petersen CCH (2010) Membrane potential dynamics of GABAergic neurons in the barrel cortex of behaving mice. Neuron 65: 422–435. [9] Caporale N, Dan Y (2008) Spike-timing-dependent plasticity: a Hebbian learning rule. Annu Rev Neurosci 31: 25–46. [10] Boerlin M, Deneve S (2011) Spike-based population coding and working memory. PLoS Comput Biol 7, e1001080. [11] Boerlin M, Machens CK, Deneve S (2012) Predictive coding of dynamic variables in balanced spiking networks. under review. [12] Clopath C, B¨ sing L, Vasilaki E, Gerstner W (2010) Connectivity reflects coding: a model of u voltage-based STDP with homeostasis. Nat Neurosci 13(3): 344–352. [13] van Vreeswijk C, Sompolinsky H (1998) Chaotic balanced state in a model of cortical circuits. Neural Comput 10(6): 1321–1371. [14] Brunel N (2000) Dynamics of sparsely connected networks of excitatory and inhibitory neurons. J Comput Neurosci 8, 183–208. [15] Vogels TP, Rajan K, Abbott LF (2005) Neural network dynamics. Annu Rev Neurosci 28: 357–376. [16] Vogels TP, Sprekeler H, Zenke F, Clopath C, Gerstner W. (2011) Inhibitory plasticity balances excitation and inhibition in sensory pathways and memory networks. Science 334(6062):1569– 73. [17] Song S, Miller KD, Abbott LF (2000) Competitive Hebbian learning through spike-timingdependent synaptic plasticity. Nat Neurosci 3(9): 919–926. 9

3 0.19473962 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

Abstract: We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method unifies many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean field update equations have been derived. Empirically we show significant speed-ups for probabilistic inference using our bound. 1

4 0.16680025 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter

Author: Dmitri B. Chklovskii, Daniel Soudry

Abstract: We test the hypothesis that the neuronal spike generation mechanism is an analog-to-digital (AD) converter encoding rectified low-pass filtered summed synaptic currents into a spike train linearly decodable in postsynaptic neurons. Faithful encoding of an analog waveform by a binary signal requires that the spike generation mechanism has a sampling rate exceeding the Nyquist rate of the analog signal. Such oversampling is consistent with the experimental observation that the precision of the spikegeneration mechanism is an order of magnitude greater than the cut -off frequency of low-pass filtering in dendrites. Additional improvement in the coding accuracy may be achieved by noise-shaping, a technique used in signal processing. If noise-shaping were used in neurons, it would reduce coding error relative to Poisson spike generator for frequencies below Nyquist by introducing correlations into spike times. By using experimental data from three different classes of neurons, we demonstrate that biological neurons utilize noise-shaping. Therefore, the spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter. The nature of the neural spike code remains a central problem in neuroscience [1-3]. In particular, no consensus exists on whether information is encoded in firing rates [4, 5] or individual spike timing [6, 7]. On the single-neuron level, evidence exists to support both points of view. On the one hand, post-synaptic currents are low-pass-filtered by dendrites with the cut-off frequency of approximately 30Hz [8], Figure 1B, providing ammunition for the firing rate camp: if the signal reaching the soma is slowly varying, why would precise spike timing be necessary? On the other hand, the ability of the spike-generation mechanism to encode harmonics of the injected current up to about 300Hz [9, 10], Figure 1B, points at its exquisite temporal precision [11]. Yet, in view of the slow variation of the somatic current, such precision may seem gratuitous and puzzling. The timescale mismatch between gradual variation of the somatic current and high precision of spike generation has been addressed previously. Existing explanations often rely on the population nature of the neural code [10, 12]. Although this is a distinct possibility, the question remains whether invoking population coding is necessary. Other possible explanations for the timescale mismatch include the possibility that some synaptic currents (for example, GABAergic) may be generated by synapses proximal to the soma and therefore not subject to low-pass filtering or that the high frequency harmonics are so strong in the pre-synaptic spike that despite attenuation, their trace is still present. Although in some cases, these explanations could apply, for the majority of synaptic inputs to typical neurons there is a glaring mismatch. The perceived mismatch between the time scales of somatic currents and the spike-generation mechanism can be resolved naturally if one views spike trains as digitally encoding analog somatic currents [13-15], Figure 1A. Although somatic currents vary slowly, information that could be communicated by their analog amplitude far exceeds that of binary signals, such as all- or-none spikes, of the same sampling rate. Therefore, faithful digital encoding requires sampling rate of the digital signal to be much higher than the cut-off frequency of the analog signal, socalled over-sampling. Although the spike generation mechanism operates in continuous time, the high temporal precision of the spikegeneration mechanism may be viewed as a manifestation of oversampling, which is needed for the digital encoding of the analog signal. Therefore, the extra order of magnitude in temporal precision available to the spike-generation mechanism relative to somatic current, Figure 1B, is necessary to faithfully encode the amplitude of the analog signal, thus potentially reconciling the firing rate and the spike timing points of view [13-15]. Figure 1. Hybrid digital-analog operation of neuronal circuits. A. Post-synaptic currents are low-pass filtered and summed in dendrites (black) to produce a somatic current (blue). This analog signal is converted by the spike generation mechanism into a sequence of all-or-none spikes (green), a digital signal. Spikes propagate along an axon and are chemically transduced across synapses (gray) into post-synatpic currents (black), whose amplitude reflects synaptic weights, thus converting digital signal back to analog. B. Frequency response function for dendrites (blue, adapted from [8]) and for the spike generation mechanism (green, adapted from [9]). Note one order of magnitude gap between the cut off frequencies. C. Amplitude of the summed postsynaptic currents depends strongly on spike timing. If the blue spike arrives just 5ms later, as shown in red, the EPSCs sum to a value already 20% less. Therefore, the extra precision of the digital signal may be used to communicate the amplitude of the analog signal. In signal processing, efficient AD conversion combines the principle of oversampling with that of noise-shaping, which utilizes correlations in the digital signal to allow more accurate encoding of the analog amplitude. This is exemplified by a family of AD converters called modulators [16], of which the basic one is analogous to an integrate-and-fire (IF) neuron [13-15]. The analogy between the basic modulator and the IF neuron led to the suggestion that neurons also use noise-shaping to encode incoming analog current waveform in the digital spike train [13]. However, the hypothesis of noise-shaping AD conversion has never been tested experimentally in biological neurons. In this paper, by analyzing existing experimental datasets, we demonstrate that noise-shaping is present in three different classes of neurons from vertebrates and invertebrates. This lends support to the view that neurons act as oversampling and noise-shaping AD converters and accounts for the mismatch between the slowly varying somatic currents and precise spike timing. Moreover, we show that the degree of noise-shaping in biological neurons exceeds that used by basic  modulators or IF neurons and propose viewing more complicated models in the noise-shaping framework. This paper is organized as follows: We review the principles of oversampling and noise-shaping in Section 2. In Section 3, we present experimental evidence for noise-shaping AD conversion in neurons. In Section 4 we argue that rectification of somatic currents may improve energy efficiency and/or implement de-noising. 2 . Oversampling and noise-shaping in AD converters To understand how oversampling can lead to more accurate encoding of the analog signal amplitude in a digital form, we first consider a Poisson spike encoder, whose rate of spiking is modulated by the signal amplitude, Figure 2A. Such an AD converter samples an analog signal at discrete time points and generates a spike with a probability given by the (normalized) signal amplitude. Because of the binary nature of spike trains, the resulting spike train encodes the signal with a large error even when the sampling is done at Nyquist rate, i.e. the lowest rate for alias-free sampling. To reduce the encoding error a Poisson encoder can sample at frequencies, fs , higher than Nyquist, fN – hence, the term oversampling, Figure 2B. When combined with decoding by lowpass filtering (down to Nyquist) on the receiving end, this leads to a reduction of the error, which can be estimated as follows. The number of samples over a Nyquist half-period (1/2fN) is given by the oversampling ratio: . As the normalized signal amplitude, , stays roughly constant over the Nyquist half-period, it can be encoded by spikes generated with a fixed probability, x. For a Poisson process the variance in the number of spikes is equal to the mean, . Therefore, the mean relative error of the signal decoded by averaging over the Nyquist half-period: , (1) indicating that oversampling reduces transmission error. However, the weak dependence of the error on the oversampling frequency indicates diminishing returns on the investment in oversampling and motivates one to search for other ways to lower the error. Figure 2. Oversampling and noise-shaping in AD conversion. A. Analog somatic current (blue) and its digital code (green). The difference between the green and the blue curves is encoding error. B. Digital output of oversampling Poisson encoder over one Nyquist half-period. C. Error power spectrum of a Nyquist (dark green) and oversampled (light green) Poisson encoder. Although the total error power is the same, the fraction surviving low-pass filtering during decoding (solid green) is smaller in oversampled case. D. Basic  modulator. E. Signal at the output of the integrator. F. Digital output of the  modulator over one Nyquist period. G. Error power spectrum of the  modulator (brown) is shifted to higher frequencies and low-pass filtered during decoding. The remaining error power (solid brown) is smaller than for Poisson encoder. To reduce encoding error beyond the ½ power of the oversampling ratio, the principle of noiseshaping was put forward [17]. To illustrate noise-shaping consider a basic AD converter called  [18], Figure 2D. In the basic  modulator, the previous quantized signal is fed back and subtracted from the incoming signal and then the difference is integrated in time. Rather than quantizing the input signal, as would be done in the Poisson encoder,  modulator quantizes the integral of the difference between the incoming analog signal and the previous quantized signal, Figure 2F. One can see that, in the oversampling regime, the quantization error of the basic  modulator is significantly less than that of the Poisson encoder. As the variance in the number of spikes over the Nyquist period is less than one, the mean relative error of the signal is at most, , which is better than the Poisson encoder. To gain additional insight and understand the origin of the term noise-shaping, we repeat the above analysis in the Fourier domain. First, the Poisson encoder has a flat power spectrum up to the sampling frequency, Figure 2C. Oversampling preserves the total error power but extends the frequency range resulting in the lower error power below Nyquist. Second, a more detailed analysis of the basic  modulator, where the dynamics is linearized by replacing the quantization device with a random noise injection [19], shows that the quantization noise is effectively differentiated. Taking the derivative in time is equivalent to multiplying the power spectrum of the quantization noise by frequency squared. Such reduction of noise power at low frequencies is an example of noise shaping, Figure 2G. Under the additional assumption of the white quantization noise, such analysis yields: , (2) which for R >> 1 is significantly better performance than for the Poisson encoder, Eq.(1). As mentioned previously, the basic  modulator, Figure 2D, in the continuous-time regime is nothing other than an IF neuron [13, 20, 21]. In the IF neuron, quantization is implemented by the spike generation mechanism and the negative feedback corresponds to the after-spike reset. Note that resetting the integrator to zero is strictly equivalent to subtraction only for continuous-time operation. In discrete-time computer simulations, the integrator value may exceed the threshold, and, therefore, subtraction of the threshold value rather than reset must be used. Next, motivated by the -IF analogy, we look for the signs of noise-shaping AD conversion in real neurons. 3 . Experimental evidence of noise-shaping AD conversion in real neurons In order to determine whether noise-shaping AD conversion takes place in biological neurons, we analyzed three experimental datasets, where spike trains were generated by time-varying somatic currents: 1) rat somatosensory cortex L5 pyramidal neurons [9], 2) mouse olfactory mitral cells [22, 23], and 3) fruit fly olfactory receptor neurons [24]. In the first two datasets, the current was injected through an electrode in whole-cell patch clamp mode, while in the third, the recording was extracellular and the intrinsic somatic current could be measured because the glial compartment included only one active neuron. Testing the noise-shaping AD conversion hypothesis is complicated by the fact that encoded and decoded signals are hard to measure accurately. First, as somatic current is rectified by the spikegeneration mechanism, only its super-threshold component can be encoded faithfully making it hard to know exactly what is being encoded. Second, decoding in the dendrites is not accessible in these single-neuron recordings. In view of these difficulties, we start by simply computing the power spectrum of the reconstruction error obtained by subtracting a scaled and shifted, but otherwise unaltered, spike train from the somatic current. The scaling factor was determined by the total weight of the decoding linear filter and the shift was optimized to maximize information capacity, see below. At the frequencies below 20Hz the error contains significantly lower power than the input signal, Figure 3, indicating that the spike generation mechanism may be viewed as an AD converter. Furthermore, the error power spectrum of the biological neuron is below that of the Poisson encoder, thus indicating the presence of noise-shaping. For dataset 3 we also plot the error power spectrum of the IF neuron, the threshold of which is chosen to generate the same number of spikes as the biological neuron. 4 somatic current biological neuron error Poisson encoder error I&F; neuron error 10 1 10 0 Spectral power, a.u. Spectral power, a.u. 10 3 10 -1 10 -2 10 -3 10 2 10 -4 10 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 3. Evidence of noise-shaping. Power spectra of the somatic current (blue), difference between the somatic current and the digital spike train of the biological neuron (black), of the Poisson encoder (green) and of the IF neuron (red). Left: datset 1, right: dataset 3. Although the simple analysis presented above indicates noise-shaping, subtracting the spike train from the input signal, Figure 3, does not accurately quantify the error when decoding involves additional filtering. An example of such additional encoding/decoding is predictive coding, which will be discussed below [25]. To take such decoding filter into account, we computed a decoded waveform by convolving the spike train with the optimal linear filter, which predicts the somatic current from the spike train with the least mean squared error. Our linear decoding analysis lends additional support to the noise-shaping AD conversion hypothesis [13-15]. First, the optimal linear filter shape is similar to unitary post-synaptic currents, Figure 4B, thus supporting the view that dendrites reconstruct the somatic current of the presynaptic neuron by low-pass filtering the spike train in accordance with the noise-shaping principle [13]. Second, we found that linear decoding using an optimal filter accounts for 60-80% of the somatic current variance. Naturally, such prediction works better for neurons in suprathreshold regime, i.e. with high firing rates, an issue to which we return in Section 4. To avoid complications associated with rectification for now we focused on neurons which were in suprathreshold regime by monitoring that the relationship between predicted and actual current is close to linear. 2 10 C D 1 10 somatic current biological neuron error Poisson encoder error Spectral power, a.u. Spectral power, a.u. I&F; neuron error 3 10 0 10 -1 10 -2 10 -3 10 2 10 -4 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 10 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 4. Linear decoding of experimentally recorded spike trains. A. Waveform of somatic current (blue), resulting spike train (black), and the linearly decoded waveform (red) from dataset 1. B. Top: Optimal linear filter for the trace in A, is representative of other datasets as well. Bottom: Typical EPSPs have a shape similar to the decoding filter (adapted from [26]). C-D. Power spectra of the somatic current (blue), the decdoding error of the biological neuron (black), the Poisson encoder (green), and IF neuron (red) for dataset 1 (C) dataset 3 (D). Next, we analyzed the spectral distribution of the reconstruction error calculated by subtracting the decoded spike train, i.e. convolved with the computed optimal linear filter, from the somatic current. We found that at low frequencies the error power is significantly lower than in the input signal, Figure 4C,D. This observation confirms that signals below the dendritic cut-off frequency of 20-30Hz can be efficiently communicated using spike trains. To quantify the effect of noise-shaping we computed information capacity of different encoders: where S(f) and N(f) are the power spectra of the somatic current and encoding error correspondingly and the sum is computed only over the frequencies for which S(f) > N(f). Because the plots in Figure 4C,D use semi-logrithmic scale, the information capacity can be estimated from the area between a somatic current (blue) power spectrum and an error power spectrum. We find that the biological spike generation mechanism has higher information capacity than the Poisson encoder and IF neurons. Therefore, neurons act as AD converters with stronger noise-shaping than IF neurons. We now return to the predictive nature of the spike generation mechanism. Given the causal nature of the spike generation mechanism it is surprising that the optimal filters for all three datasets carry most of their weight following a spike, Figure 4B. This indicates that the spike generation mechanism is capable of making predictions, which are possible in these experiments because somatic currents are temporally correlated. We note that these observations make delay-free reconstruction of the signal possible, thus allowing fast operation of neural circuits [27]. The predictive nature of the encoder can be captured by a  modulator embedded in a predictive coding feedback loop [28], Figure 5A. We verified by simulation that such a nested architecture generates a similar optimal linear filter with most of its weight in the time following a spike, Figure 5A right. Of course such prediction is only possible for correlated inputs implying that the shape of the optimal linear filter depends on the statistics of the inputs. The role of predictive coding is to reduce the dynamic range of the signal that enters , thus avoiding overloading. A possible biological implementation for such integrating feedback could be Ca2+ 2+ concentration and Ca dependent potassium channels [25, 29]. Figure 5. Enhanced  modulators. A.  modulator combined with predictive coder. In such device, the optimal decoding filter computed for correlated inputs has most of its weight following a spike, similar to experimental measurements, Figure 4B. B. Second-order  modulator possesses stronger noise-shaping properties. Because such circuit contains an internal state variable it generates a non-periodic spike train in response to a constant input. Bottom trace shows a typical result of a simulation. Black – spikes, blue – input current. 4 . Possible reasons for current rectification: energy efficiency and de-noising We have shown that at high firing rates biological neurons encode somatic current into a linearly decodable spike train. However, at low firing rates linear decoding cannot faithfully reproduce the somatic current because of rectification in the spike generation mechanism. If the objective of spike generation is faithful AD conversion, why would such rectification exist? We see two potential reasons: energy efficiency and de-noising. It is widely believed that minimizing metabolic costs is an important consideration in brain design and operation [30, 31]. Moreover, spikes are known to consume a significant fraction of the metabolic budget [30, 32] placing a premium on their total number. Thus, we can postulate that neuronal spike trains find a trade-off between the mean squared error in the decoded spike train relative to the input signal and the total number of spikes, as expressed by the following cost function over a time interval T: , (3) where x is the analog input signal, s is the binary spike sequence composed of zeros and ones, and is the linear filter. To demonstrate how solving Eq.(3) would lead to thresholding, let us consider a simplified version taken over a Nyquist period, during which the input signal stays constant: (4) where and normalized by w. Minimizing such a cost function reduces to choosing the lowest lying parabola for a given , Figure 6A. Therefore, thresholding is a natural outcome of minimizing a cost function combining the decoding error and the energy cost, Eq.(3). In addition to energy efficiency, there may be a computational reason for thresholding somatic current in neurons. To illustrate this point, we note that the cost function in Eq. (3) for continuous variables, st, may be viewed as a non-negative version of the L1-norm regularized linear regression called LASSO [33], which is commonly used for de-noising of sparse and Laplacian signals [34]. Such cost function can be minimized by iteratively applying a gradient descent and a shrinkage steps [35], which is equivalent to thresholding (one-sided in case of non-negative variables), Figure 6B,C. Therefore, neurons may be encoding a de-noised input signal. Figure 6. Possible reasons for rectification in neurons. A. Cost function combining encoding error squared with metabolic expense vs. input signal for different values of the spike number N, Eq.(4). Note that the optimal number of spikes jumps from zero to one as a function of input. B. Estimating most probable “clean” signal value for continuous non-negative Laplacian signal and Gaussian noise, Eq.(3) (while setting w = 1). The parabolas (red) illustrate the quadratic loglikelihood term in (3) for different values of the measurement, s, while the linear function (blue) reflects the linear log-prior term in (3). C. The minimum of the combined cost function in B is at zero if s , and grows linearly with s, if s >. 5 . Di scu ssi on In this paper, we demonstrated that the neuronal spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter, which encodes a rectified low-pass filtered somatic current as a digital spike train. Rectification by the spike generation mechanism may subserve both energy efficiency and de-noising. As the degree of noise-shaping in biological neurons exceeds that in IF neurons, or basic , we suggest that neurons should be modeled by more advanced  modulators, e.g. Figure 5B. Interestingly,  modulators can be also viewed as coders with error prediction feedback [19]. Many publications studied various aspects of spike generation in neurons yet we believe that the framework [13-15] we adopt is different and discuss its relationship to some of the studies. Our framework is different from previous proposals to cast neurons as predictors [36, 37] because a different quantity is being predicted. The possibility of perfect decoding from a spike train with infinite temporal precision has been proven in [38]. Here, we are concerned with a more practical issue of how reconstruction error scales with the over-sampling ratio. Also, we consider linear decoding which sets our work apart from [39]. Finally, previous experiments addressing noiseshaping [40] studied the power spectrum of the spike train rather than that of the encoding error. Our work is aimed at understanding biological and computational principles of spike-generation and decoding and is not meant as a substitute for the existing phenomenological spike-generation models [41], which allow efficient fitting of parameters and prediction of spike trains [42]. Yet, the theoretical framework [13-15] we adopt may assist in building better models of spike generation for a given somatic current waveform. First, having interpreted spike generation as AD conversion, we can draw on the rich experience in signal processing to attack the problem. Second, this framework suggests a natural metric to compare the performance of different spike generation models in the high firing rate regime: a mean squared error between the injected current waveform and the filtered version of the spike train produced by a model provided the total number of spikes is the same as in the experimental data. The AD conversion framework adds justification to the previously proposed spike distance obtained by subtracting low-pass filtered spike trains [43]. As the framework [13-15] we adopt relies on viewing neuronal computation as an analog-digital hybrid, which requires AD and DA conversion at every step, one may wonder about the reason for such a hybrid scheme. Starting with the early days of computers, the analog mode is known to be advantageous for computation. For example, performing addition of many variables in one step is possible in the analog mode simply by Kirchhoff law, but would require hundreds of logical gates in the digital mode [44]. However, the analog mode is vulnerable to noise build-up over many stages of computation and is inferior in precisely communicating information over long distances under limited energy budget [30, 31]. While early analog computers were displaced by their digital counterparts, evolution combined analog and digital modes into a computational hybrid [44], thus necessitating efficient AD and DA conversion, which was the focus of the present study. We are grateful to L. Abbott, S. Druckmann, D. Golomb, T. Hu, J. Magee, N. Spruston, B. Theilman for helpful discussions and comments on the manuscript, to X.-J. Wang, D. McCormick, K. Nagel, R. Wilson, K. Padmanabhan, N. Urban, S. Tripathy, H. Koendgen, and M. Giugliano for sharing their data. The work of D.S. was partially supported by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI). R e f e re n c e s 1. Ferster, D. and N. Spruston, Cracking the neural code. Science, 1995. 270: p. 756-7. 2. Panzeri, S., et al., Sensory neural codes using multiplexed temporal scales. Trends Neurosci, 2010. 33(3): p. 111-20. 3. Stevens, C.F. and A. Zador, Neural coding: The enigma of the brain. Curr Biol, 1995. 5(12): p. 1370-1. 4. Shadlen, M.N. and W.T. Newsome, The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci, 1998. 18(10): p. 3870-96. 5. Shadlen, M.N. and W.T. Newsome, Noise, neural codes and cortical organization. Curr Opin Neurobiol, 1994. 4(4): p. 569-79. 6. Singer, W. and C.M. Gray, Visual feature integration and the temporal correlation hypothesis. Annu Rev Neurosci, 1995. 18: p. 555-86. 7. Meister, M., Multineuronal codes in retinal signaling. Proc Natl Acad Sci U S A, 1996. 93(2): p. 609-14. 8. Cook, E.P., et al., Dendrite-to-soma input/output function of continuous timevarying signals in hippocampal CA1 pyramidal neurons. J Neurophysiol, 2007. 98(5): p. 2943-55. 9. Kondgen, H., et al., The dynamical response properties of neocortical neurons to temporally modulated noisy inputs in vitro. Cereb Cortex, 2008. 18(9): p. 2086-97. 10. Tchumatchenko, T., et al., Ultrafast population encoding by cortical neurons. J Neurosci, 2011. 31(34): p. 12171-9. 11. Mainen, Z.F. and T.J. Sejnowski, Reliability of spike timing in neocortical neurons. Science, 1995. 268(5216): p. 1503-6. 12. Mar, D.J., et al., Noise shaping in populations of coupled model neurons. Proc Natl Acad Sci U S A, 1999. 96(18): p. 10450-5. 13. Shin, J., Adaptive noise shaping neural spike encoding and decoding. Neurocomputing, 2001. 38-40: p. 369-381. 14. Shin, J., The noise shaping neural coding hypothesis: a brief history and physiological implications. Neurocomputing, 2002. 44: p. 167-175. 15. Shin, J.H., Adaptation in spiking neurons based on the noise shaping neural coding hypothesis. Neural Networks, 2001. 14(6-7): p. 907-919. 16. Schreier, R. and G.C. Temes, Understanding delta-sigma data converters2005, Piscataway, NJ: IEEE Press, Wiley. xii, 446 p. 17. Candy, J.C., A use of limit cycle oscillations to obtain robust analog-to-digital converters. IEEE Trans. Commun, 1974. COM-22: p. 298-305. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. Inose, H., Y. Yasuda, and J. Murakami, A telemetring system code modulation -  modulation. IRE Trans. Space Elect. Telemetry, 1962. SET-8: p. 204-209. Spang, H.A. and P.M. Schultheiss, Reduction of quantizing noise by use of feedback. IRE TRans. Commun. Sys., 1962: p. 373-380. Hovin, M., et al., Delta-Sigma modulation in single neurons, in IEEE International Symposium on Circuits and Systems2002. Cheung, K.F. and P.Y.H. Tang, Sigma-Delta Modulation Neural Networks. Proc. IEEE Int Conf Neural Networkds, 1993: p. 489-493. Padmanabhan, K. and N. Urban, Intrinsic biophysical diversity decorelates neuronal firing while increasing information content. Nat Neurosci, 2010. 13: p. 1276-82. Urban, N. and S. Tripathy, Neuroscience: Circuits drive cell diversity. Nature, 2012. 488(7411): p. 289-90. Nagel, K.I. and R.I. Wilson, personal communication. Shin, J., C. Koch, and R. Douglas, Adaptive neural coding dependent on the timevarying statistics of the somatic input current. Neural Comp, 1999. 11: p. 1893-913. Magee, J.C. and E.P. Cook, Somatic EPSP amplitude is independent of synapse location in hippocampal pyramidal neurons. Nat Neurosci, 2000. 3(9): p. 895-903. Thorpe, S., D. Fize, and C. Marlot, Speed of processing in the human visual system. Nature, 1996. 381(6582): p. 520-2. Tewksbury, S.K. and R.W. Hallock, Oversample, linear predictive and noiseshaping coders of order N>1. IEEE Trans Circuits & Sys, 1978. CAS25: p. 436-47. Wang, X.J., et al., Adaptation and temporal decorrelation by single neurons in the primary visual cortex. J Neurophysiol, 2003. 89(6): p. 3279-93. Attwell, D. and S.B. Laughlin, An energy budget for signaling in the grey matter of the brain. J Cereb Blood Flow Metab, 2001. 21(10): p. 1133-45. Laughlin, S.B. and T.J. Sejnowski, Communication in neuronal networks. Science, 2003. 301(5641): p. 1870-4. Lennie, P., The cost of cortical computation. Curr Biol, 2003. 13(6): p. 493-7. Tibshirani, R., Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 1996. 58(1): p. 267-288. Chen, S.S.B., D.L. Donoho, and M.A. Saunders, Atomic decomposition by basis pursuit. Siam Journal on Scientific Computing, 1998. 20(1): p. 33-61. Elad, M., et al., Wide-angle view at iterated shrinkage algorithms. P SOc Photo-Opt Ins, 2007. 6701: p. 70102. Deneve, S., Bayesian spiking neurons I: inference. Neural Comp, 2008. 20: p. 91. Yu, A.J., Optimal Change-Detection and Spinking Neurons, in NIPS, B. Scholkopf, J. Platt, and T. Hofmann, Editors. 2006. Lazar, A. and L. Toth, Perfect Recovery and Sensitivity Analysis of Time Encoded Bandlimited Signals. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, 2004. 51(10). Pfister, J.P., P. Dayan, and M. Lengyel, Synapses with short-term plasticity are optimal estimators of presynaptic membrane potentials. Nat Neurosci, 2010. 13(10): p. 1271-5. Chacron, M.J., et al., Experimental and theoretical demonstration of noise shaping by interspike interval correlations. Fluctuations and Noise in Biological, Biophysical, and Biomedical Systems III, 2005. 5841: p. 150-163. Pillow, J., Likelihood-based approaches to modeling the neural code, in Bayesian Brain: Probabilistic Approaches to Neural Coding, K. Doya, et al., Editors. 2007, MIT Press. Jolivet, R., et al., A benchmark test for a quantitative assessment of simple neuron models. J Neurosci Methods, 2008. 169(2): p. 417-24. van Rossum, M.C., A novel spike distance. Neural Comput, 2001. 13(4): p. 751-63. Sarpeshkar, R., Analog versus digital: extrapolating from electronics to neurobiology. Neural Computation, 1998. 10(7): p. 1601-38.

5 0.16028675 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

6 0.15463275 195 nips-2012-Learning visual motion in recurrent neural networks

7 0.14677496 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

8 0.14653514 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

9 0.14316285 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity

10 0.14182049 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

11 0.13867143 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

12 0.12791987 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks

13 0.12477428 150 nips-2012-Hierarchical spike coding of sound

14 0.11974318 12 nips-2012-A Neural Autoregressive Topic Model

15 0.11794819 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

16 0.11714027 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

17 0.11707328 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

18 0.11549298 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

19 0.11492036 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

20 0.10712024 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.228), (1, 0.081), (2, -0.167), (3, 0.182), (4, -0.231), (5, 0.234), (6, -0.01), (7, 0.036), (8, 0.054), (9, 0.03), (10, 0.126), (11, 0.112), (12, 0.084), (13, 0.023), (14, 0.05), (15, 0.101), (16, 0.01), (17, 0.052), (18, -0.01), (19, 0.089), (20, -0.019), (21, 0.002), (22, 0.026), (23, -0.004), (24, 0.015), (25, -0.042), (26, 0.004), (27, 0.055), (28, -0.049), (29, 0.019), (30, 0.007), (31, -0.012), (32, -0.012), (33, 0.023), (34, 0.028), (35, 0.02), (36, -0.001), (37, 0.007), (38, -0.018), (39, -0.023), (40, 0.085), (41, 0.006), (42, 0.031), (43, 0.012), (44, 0.038), (45, 0.036), (46, 0.004), (47, -0.015), (48, -0.022), (49, -0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94853187 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

Author: Jeff Beck, Alexandre Pouget, Katherine A. Heller

Abstract: Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efficient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be naturally implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identification of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classification, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb’s rule and describe an extension of this work which allows us to deal with time varying and correlated latent causes. 1 Introduction to Probabilistic Inference in Cortex Probabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for dealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that behavioural experiments reliably demonstrate that humans and animals behave in a manner consistent with optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor [4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents probability distribution functions of task relevant stimuli rather than just single values. While there 1 are many ways to represent functions, Bayes rule tells us that when it comes to probability distribution functions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states that any pattern of activity, r, that efficiently represents a probability distribution over some task relevant quantity s, must satisfy the relationship p(s|r) ∝ p(r|s)p(s), where p(r|s) is the stimulus conditioned likelihood function that specifies the form of neural variability, p(s) gives the prior belief regarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given the representation r . Of course, it is unlikely that the nervous system consistently achieves this level of optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as characterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning machine such as the brain. The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). When this is the case, the precise form of the neural variability informs the nature of the neural code. For example, the exponential family of statistical models with linear sufficient statistics has been shown to be flexible enough to model the first and second order statistics of in vivo recordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i.e. log p(s|r) = h(s) · r − log Z(r) (1) Here, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard dot product, and Z(r) is the partition function which serves to normalize the posterior. This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. This includes, linear PPC implementations of cue combination[8], evidence integration over time, maximum likelihood and maximum a posterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman filtering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. Unfortunately, this research program has been a piecemeal effort that has largely proceeded by building neural networks designed deal with particular problems. As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. In this work, we will partially address this short coming by showing that Variation Bayesian Expectation Maximization (VBEM) algorithm provides a general scheme for approximate inference and learning with linear PPCs. In section 2, we briefly review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. Because this section describes the VB-PPC approach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. This result suggests that this particular non-linearity may be involved in very general and fundamental probabilistic computation, rather than simply playing a role in gain modulation. In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term fluctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. Properly implemented it can result a fast alternative to sampling based methods of inference such as MCMC[19] sampling. Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which specifies the joint distribution p(X, Θ, Z). This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i.e. we wish to compute, p(Z|X) ∝ p(X, Θ, Z)dΘ (2) When the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms simplify this marginalization by approximating the complex joint distribution over behaviourally relevant latents and parameters, p(Θ, Z|X), with a distribution q(Θ, Z) for which integrals of this form are easier to deal with in some sense. There is some art to choosing the particular form for the approximating distribution to make the above integral tractable, however, a factorized approximation is common, i.e. q(Θ, Z) = qΘ (Θ)qZ (Z). Regardless, for any given observation X, the approximate posterior is found by minimizing the Kullback-Leibler divergence between q(Θ, Z) and p(Θ, Z|X). When a factorized posterior is assumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm finds a local minimum of the KL divergence by iteratively updating, qΘ (Θ) and qZ (Z) according to the scheme n log qΘ (Θ) ∼ log p(X, Θ, Z) n qZ (Z) and n+1 log qZ (Z) ∼ log p(X, Θ, Z) n qΘ (Θ) (3) Here the brackets indicate an expected value taken with respect to the subscripted probability distribution function and the tilde indicates equality up to a constant which is independent of Θ and Z. The key property to note here is that the approximate posterior which results from this procedure is in an exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows for the straightforward construction of networks which implement the VBEM algorithm with linear PPC’s in the following way. If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . Z (4) Here the stimulus dependent kernels hZ (Z) and hΘ (Θ) are chosen so that their outer product results in a basis that spans the function space on Z × Θ given by log p(X, Θ, Z) for every X. This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq. 5 until convergence will find the VBEM approximation to the posterior. This is one way to build a neural network implementation of the VB algorithm. However, its not the only way. In general, any dynamical system which has stable fixed points in common with Eq. 5 can also be said to implement the VBEM algorithm. In the example below we will take advantage of this flexibility in order to build biologically plausible neural network implementations. 3 Response! to Mixture ! of Odors! Single Odor Response Cause Intensity Figure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identification depicted in Fig. 1. A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. This results in a highly distributed code for each odor. Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. A simple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA) [18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. It also assumes that there are K topics and that each of these topics appears in different proportions in each document, e.g. 80% of the words in a document might be concerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn from a distribution over words associated with that topic, e.g. when talking about coffee you have a 5% chance of using the word ’bitter’. The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. We can map the generative model for LDA onto the task of spike demixing in cortex by letting topics become latent causes or odors, words become neurons, word occurrences become spikes, word distributions associated with each topic become patterns of neural activity associated with each cause, and different documents become the observed patterns of neural activity on different trials. This equivalence is made explicit in Fig. 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. For each topic k = 1, . . . , K, (a) Distribution over words βk ∼ Dirichlet(η0 ) 2. For document d = 1, . . . , D, (a) Distribution over topics θd ∼ Dirichlet(α0 ) (b) For word m = 1, . . . , Ωd i. Topic assignment zd,m ∼ Multinomial(θd ) ii. Word assignment ωd,m ∼ Multinomial(βzm ) 1. For latent cause k = 1, . . . , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. For scene d = 1, . . . , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . . . , Ωd i. Cause assignment zd,m ∼ Multinomial(θd ) ii. Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. Text related attributes on the left, in red, have been replaced with neural attributes on the right, in green. rather than topic assignments. Specifically, we define, Rd,j to be the number of times neuron j fires during trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). (6) is equivalent to the topic models described above. Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. Note that, the problem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to that considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to infer a probability distribution over these quantities rather than maximum likelihood estimates. See supplement for details. Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . . . , D, with a factorized distribution of the form, qN (N)qc (c)qβ (β). This results in marginal posterior distributions q (β:,k |η:,k ), q cd,k |αd,k , C −1 + 1 ), and q (Nd,j,: | log pd,j,: , Rd,i ) which are Dirichlet, Gamma, and Multinomial respectively. Here, the parameters η:,k , αd,k , and log pd,j,: are the natural parameters of these distributions. The VBEM update algorithm yields update rules for these parameters which are summarized in Fig. 3 Algorithm1. Algorithm 1: Batch VB updates 1: while ηj,k not converged do 2: for d = 1, · · · , D do 3: while pd,j,k , αd,k not converged do 4: αd,k → α0 + j Rd,j pd,j,k 5: pd,j,k → Algorithm 2: Online VB updates 1: for d = 1, · · · , D do 2: reinitialize pj,k , αk ∀j, k 3: while pj,k , αk not converged do 4: αk → α0 + j Rd,j pj,k 5: pj,k → exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αk ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αi ) exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αd,k ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αd,i ) 6: end while 7: end for 8: ηj,k = η 0 + 9: end while end while ηj,k → (1 − dt)ηj,k + dt(η 0 + Rd,j pj,k ) 8: end for 6: 7: d Rd,j pd,j,k Figure 3: Here ηk = j ηj,k and ψ(x) is the digamma function so that exp ψ(x) is a smoothed ¯ threshold linear function. Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. Fortunately, an online version of this variational algorithm was recently proposed and shown to give 5 superior results when compared to the batch learning algorithm[21]. This algorithm replaces the sum over d in update equation for ηj,k with an incremental update based upon only the most recently observed pattern of spikes. See Fig. 3 Algorithm 2. 4.1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. One way to accomplish this is to simply assume that the firing rates of output neurons are equal to the positive valued αk parameters. Fig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. A continuous time dynamical system with this feature and the property that it shares its fixed points with the LDA algorithm is given by d ¯ Nj,k dt d αk dt ¯ ¯ = wj,k Rj − Rj Nj,k = (7) ¯ Nj,k exp (ψ (¯k )) (α0 − αk ) + exp (ψ (αk )) η (8) i ¯ where Rj = k wj,k exp (ψ (αk )), and wj,k = exp (ψ (ηj,k )). Note that, despite its form, it is Eq. 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . Regardless, this network has a variety of interesting properties that align well with biology. It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. It also predicts superlinear facilitation. Specifically, the final term on the right of Eq. 8 indicates that more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be implemented via recurrent excitation at the population level. In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally, the ηj in Eq. 8 represents a cell wide ’leak’ parameter that indicates that the total leak should be ¯ roughly proportional to the sum total weight of the synapses which drive the neuron. This predicts that cells that are highly sensitive to input should also decay back to baseline more quickly. This implementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the online update rule for the ηj,k parameters can be implemented by simply correlating the activity at ¯ each synapse, Nj,k with activity at the soma αj via the equation: τL d ¯ wj,k = exp (ψ (¯k )) (η0 − 1/2 − wj,k ) + Nj,k exp ψ (αk ) η dt (9) where τL is a long time constant for learning and we have used the fact that exp (ψ (ηjk )) ≈ ηjk −1/2 for x > 1. For a detailed derivation see the supplementary material. 5 Dynamic Document Model LDA is a rather simple generative model that makes several unrealistic assumptions about mixtures of sensory and cortical spikes. In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). We call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model smooth changes in the distribution of topics over the course of a single document. 5.1 DDM Model Thus the generative model for the DDM is as follows: 1. For latent cause k = 1, . . . , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. For scene t = 1, . . . , T , (a) Log intensity of causes c(t) ∼ Normal(Λct−1 , ΣD ) (b) Number of spikes in neuron j resulting from cause k, Nj,k (t) ∼ Poisson(βj,k exp ck (t)) (c) Number of spikes in neuron j, Rj (t) = k Nj,k (t) This model bears many similarities to the Correlated and Dynamic topic models[22], but models dynamics over a short time scale, where the dynamic relationship (Λ, ΣD ) is important. 5.2 Network Implementation Once again the quantity of interest is the current distribution of latent causes, p(c(t)|R(τ ), τ = 0..T ). If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman filter with parameters (Λ, ΣD ). A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman filter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. Thus, the patterns of neural activity which linearly encode them must do so as well. When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. All that is required is an additional inhibitory population which targets the somata in the output population. See Fig. 5. 7 Natural Parameters Natural Parameters (α) 0.4 200 450 180 0.3 Network Estimate Network Estimate 500 400 350 300 250 200 150 100 0.1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 140 120 0.4 0.3 100 0.2 80 0.1 0 60 40 0.4 20 50 0 0 0.2 160 0 0 0.3 0.2 20 40 60 80 100 120 VBEM Estimate VBEM Estimate 140 160 180 200 0.1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the matrix Σ−1 (t) and Σ−1 µ(t) of the distribution over log topic intensities. (Right) Three example traces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable) show MAP estimates for the network and VBEM algorithms. 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same fixed points as the standard VBEM algorithm. As a result, it is possible for the two algorithms to converge to different local minima of the KL divergence. For the network implementation of LDA we find good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). This agreement, however, is not perfect, especially when posterior predicted concentrations are low. In part, this is due to the fact we are presenting the network with difficult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. As a result, the objective function (KL divergence) is littered with local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. See Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for the VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. For simulation details please see the supplement. 7 Discussion and Conclusion In this work we presented a general framework for inference and learning with linear Probabilistic Population codes. This framework takes advantage of the fact that the Variational Bayesian Expectation Maximization algorithm generates approximate posterior distributions which are in an exponential family form. This is precisely the form needed in order to make probability distributions representable by a linear PPC. We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. We chose this problem because it has many of the features of the canonical problem faced by nearly every layer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer below. Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. 8 References [1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual review of psychology, 55:271–304, January 2004. [2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, 2002. [3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature neuroscience, 5(6):598–604, 2002. [4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6): 740–6, 2000. o [5] Konrad P K¨ rding and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427 (6971):244–7, 2004. [6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–15, 2004. [7] Erno T´ gl´ s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. e a Pure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033): 1054–9, 2011. [8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nature Neuroscience, 2006. [9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N Shadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision making. Neuron, 60(6):1142–52, 2008. [10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization. Journal of Neuroscience, 31(43):15310–15319, 2011. [11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075–80, 2007. [12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye movements. Nature, 1995. [13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–45, 2011. [14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy of Sciences, 1969. [15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior and neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011. [16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992. [17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 17(21):8621–44, 1997. [18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003. [19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL, 2003. [20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788–91, 1999. [21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010. [22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006. [23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational biology, 2011. 9

2 0.72626269 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

Author: Stefan Habenschuss, Johannes Bill, Bernhard Nessler

Abstract: Recent spiking network models of Bayesian inference and unsupervised learning frequently assume either inputs to arrive in a special format or employ complex computations in neuronal activation functions and synaptic plasticity rules. Here we show in a rigorous mathematical treatment how homeostatic processes, which have previously received little attention in this context, can overcome common theoretical limitations and facilitate the neural implementation and performance of existing models. In particular, we show that homeostatic plasticity can be understood as the enforcement of a ’balancing’ posterior constraint during probabilistic inference and learning with Expectation Maximization. We link homeostatic dynamics to the theory of variational inference, and show that nontrivial terms, which typically appear during probabilistic inference in a large class of models, drop out. We demonstrate the feasibility of our approach in a spiking WinnerTake-All architecture of Bayesian inference and learning. Finally, we sketch how the mathematical framework can be extended to richer recurrent network architectures. Altogether, our theory provides a novel perspective on the interplay of homeostatic processes and synaptic plasticity in cortical microcircuits, and points to an essential role of homeostasis during inference and learning in spiking networks. 1

3 0.70834541 224 nips-2012-Multi-scale Hyper-time Hardware Emulation of Human Motor Nervous System Based on Spiking Neurons using FPGA

Author: C. M. Niu, Sirish Nandyala, Won J. Sohn, Terence Sanger

Abstract: Our central goal is to quantify the long-term progression of pediatric neurological diseases, such as a typical 10-15 years progression of child dystonia. To this purpose, quantitative models are convincing only if they can provide multi-scale details ranging from neuron spikes to limb biomechanics. The models also need to be evaluated in hyper-time, i.e. significantly faster than real-time, for producing useful predictions. We designed a platform with digital VLSI hardware for multiscale hyper-time emulations of human motor nervous systems. The platform is constructed on a scalable, distributed array of Field Programmable Gate Array (FPGA) devices. All devices operate asynchronously with 1 millisecond time granularity, and the overall system is accelerated to 365x real-time. Each physiological component is implemented using models from well documented studies and can be flexibly modified. Thus the validity of emulation can be easily advised by neurophysiologists and clinicians. For maximizing the speed of emulation, all calculations are implemented in combinational logic instead of clocked iterative circuits. This paper presents the methodology of building FPGA modules emulating a monosynaptic spinal loop. Emulated activities are qualitatively similar to real human data. Also discussed is the rationale of approximating neural circuitry by organizing neurons with sparse interconnections. In conclusion, our platform allows emulating pathological abnormalities such that motor symptoms will emerge and can be analyzed. It compels us to test the origins of childhood motor disorders and predict their long-term progressions. 1 Challenges of studying developmental motor disorders There is currently no quantitative model of how a neuropathological condition, which mainly affects the function of neurons, ends up causing the functional abnormalities identified in clinical examinations. The gap in knowledge is particularly evident for disorders in developing human nervous systems, i.e. childhood neurological diseases. In these cases, the ultimate clinical effect of cellu1 lar injury is compounded by a complex interplay among the child’s injury, development, behavior, experience, plasticity, etc. Qualitative insight has been provided by clinical experiences into the association between particular types of injury and particular types of outcome. Their quantitative linkages, nevertheless, have yet to be created – neither in clinic nor in cellular physiological tests. This discrepancy is significantly more prominent for individual child patients, which makes it very difficult to estimate the efficacy of treatment plans. In order to understand the consequence of injury and discover new treatments, it is necessary to create a modeling toolset with certain design guidelines, such that child neurological diseases can be quantitatively analyzed. Perhaps more than any other organ, the brain necessarily operates on multiple spatial and temporal scales. On the one hand, it is the neurons that perform fundamental computations, but neurons have to interact with large-scale organs (ears, eyes, skeletal muscles, etc.) to achieve global functions. This multi-scale nature worths more attention in injuries, where the overall deficits depend on both the cellular effects of injuries and the propagated consequences. On the other hand, neural processes in developmental diseases usually operate on drastically different time scales, e.g. spinal reflex in milliseconds versus learning in years. Thus when studying motor nervous systems, mathematical modeling is convincing only if it can provide multi-scale details, ranging from neuron spikes to limb biomechanics; also the models should be evaluated with time granularity as small as 1 millisecond, meanwhile the evaluation needs to continue trillions of cycles in order to cover years of life. It is particularly challenging to describe the multi-scale nature of human nervous system when modeling childhood movement disorders. Note that for a child who suffered brain injury at birth, the full development of all motor symptoms may easily take more than 10 years. Therefore the millisecondbased model needs to be evaluated significantly faster than real-time, otherwise the model will fail to produce any useful predictions in time. We have implemented realistic models for spiking motoneurons, sensory neurons, neural circuitry, muscle fibers and proprioceptors using VLSI and programmable logic technologies. All models are computed in Field Programmable Gate Array (FPGA) hardware in 365 times real-time. Therefore one year of disease progression can be assessed after one day of emulation. This paper presents the methodology of building the emulation platform. The results demonstrate that our platform is capable of producing physiologically realistic multi-scale signals, which are usually scarce in experiments. Successful emulations enabled by this platform will be used to verify theories of neuropathology. New treatment mechanisms and drug effects can also be emulated before animal experiments or clinical trials. 2 Methodology of multi-scale neural emulation A. Human arm B. Monosynaptic spinal loop C. Inner structure of muscle spindle Gamma Secondary dynamic Gamma output input static Primary input output Bag 1 αMN Bag 2 Chain Figure 1: Illustration of the multi-scale nature of motor nervous system. The motor part of human nervous system is responsible for maintaining body postures and generating voluntary movements. The multi-scale nature of motor nervous system is demonstrated in Fig.1. When the elbow (Fig.1A) is maintaining a posture or performing a movement, a force is established by the involved muscle based on how much spiking excitation the muscle receives from its αmotoneurons (Fig.1B). The α-motoneurons are regulated by a variety of sensory input, part of which comes directly from the proprioceptors in the muscle. As the primary proprioceptor found in skeletal muscles, a muscle spindle is another complex system that has its own microscopic Multiple-InputMultiple-Output structure (Fig.1C). Spindles continuously provide information about the length and lengthening speed of the muscle fiber. A muscle with its regulating motoneurons, sensory neurons and proprioceptors constitutes a monosynaptic spinal loop. This minimalist neurophysiological 2 structure is used as an example for explaining the multi-scale hyper-time emulation in hardware. Additional structures can be added to the backbone set-up using similar methodologies. 2.1 Modularized architecture for multi-scale models Decades of studies on neurophysiology provided an abundance of models characterizing different components of the human motor nervous system. The informational characteristics of physiological components allowed us to model them as functional structures, i.e. each of which converting input signals to certain outputs. In particular, within a monosynaptic spinal loop illustrated in Fig.1B, stretching the muscle will elicit a chain of physiological activities in: muscle stretch ⇒ spindle ⇒ sensory neuron ⇒ synapse ⇒ motoneuron ⇒ muscle contraction. The adjacent components must have compatible interfaces, and the interfacing variables must also be physiologically realistic. In our design, each component is mathematically described in Table 1: Table 1: Functional definition of neural models COMPONENT Neuron Synapse Muscle Spindle MATHEMATICAL DEFINITION S(t) = fneuron (I, t) I(t) = fsynapse (S, t) ˙ T (t) = fmuscle (S, L, L, t) ˙ Γdynamic , Γstatic , t) A(t) = fspindle (L, L, all components are modeled as black-box functions that map the inputs to the outputs. The meanings of these mathematical definitions are explained below. This design allows existing physiological models to be easily inserted and switched. In all models the input signals are time-varying, e.g. I = I(t), L = L(t) , etc. The argument of t in input signals are omitted throughout this paper. 2.2 Selection of models for emulation Models were selected in consideration of their computational cost, physiological verisimilitude, and whether it can be adapted to the mathematical form defined in Table 1. Model of Neuron The informational process for a neuron is to take post-synaptic current I as the input, and produce a binary spike train S in the output. The neuron model adopted in the emulation was developed by Izhikevich [1]: = 0.04v 2 + 5v + 140 − u + I = a(bv − u) v u (1) (2) if v = 30 mV, then v ← c, u ← u + d where a, b, c, d are free parameters tuned to achieve certain firing patterns. Membrane potential v directly determines a binary spike train S(t) that S(t) = 1 if v ≥ 30, otherwise S(t) = 0. Note that v in Izhikevich model is in millivolts and time t is in milliseconds. Therefore the coefficients in eq.1 need to be adjusted in correspondence to SI units. Model of Synapse When a pre-synaptic neuron spikes, i.e. S(0) = 1, an excitatory synapse subsequently issues an Excitatory Post-Synaptic Current (EPSC) that drives the post-synaptic neuron. Neural recording of hair cells in rats [2] provided evidence that the time profile of EPSC can be well characterized using the equations below: I(t) = Vm × e t d Vm −τ 0 t − e− τr Vm if t ≥ 0 (3) otherwise The key parameters in a synapse model is the time constants for rising (τr ) and decaying (τd ). In our emulation τr = 0.001 s and τr = 0.003 s. 3 Model of Muscle force and electromyograph (EMG) The primary effect of skeletal muscle is to convert α-motoneuron spikes S into force T , depending ˙ on the muscle’s instantaneous length L and lengthening speed L. We used Hill’s muscle model in the emulation with parameter tuning described in [3]. Another measurable output of muscle is electromyograph (EMG). EMG is the small skin current polarized by motor unit action potential (MUAP) when it travels along muscle fibers. Models exist to describe the typical waveform picked by surface EMG electrodes. In this project we chose to implement the one described in [4]. Model of Proprioceptor Spindle is a sensory organ that provides the main source of proprioceptive information. As can be seen in Fig.1C, a spindle typically produces two afferent outputs (primary Ia and secondary II) ˙ according to its gamma fusimotor drives (Γdynamic and Γstatic ) and muscle states (L and L). There is currently no closed-form models describing spindle functions due to spindle’s significant nonlinearity. On representative model that numerically approximates the spindle dynamics was developed by Mileusnic et al. [5]. The model used differential equations to characterize a typical cat soleus spindle. Eqs.4-10 present a subset of this model for one type of spindle fiber (bag1): Γdynamic − x0 /τ Γdynamic + Ω2 bag1 x0 ˙ = x1 ˙ = x2 1 = [TSR − TB − TP R − Γ1 x0 ] M x2 ˙ (4) (5) (6) where TSR TB TP R CSS = KSR (L − x1 − LSR0 ) (7) 0.3 = (B0 + B1 x0 ) · (x1 − R) · CSS · |x2 | = KP R (x1 − LP R0 ) 2 = −1 −1000x2 1+e (8) (9) (10) Eq.8 and 10 suggest that evaluating the spindle model requires multiplication, division as well as more complex arithmetics like polynomials and exponentials. The implementation details are described in Section 3. 2.3 Neuron connectivity with sparse interconnections Although the number of spinal neurons (~1 billion) is significantly less compared to that of cortical neurons (~100 billion), a fully connected spinal network still means approximately 2 trillion synaptic endings [6]. Implementing such a huge number of synapses imposes a major challenge, if not impossible, given limited hardware resource. In this platform we approximated the neural connectivity by sparsely connecting sensory neurons to motoneurons as parallel pathways. We do not attempt to introduce the full connectivity. The rationale is that in a neural control system, the effect of a single neuron can be considered as mapping current state x to change in state x through a band-limited channel. Therefore when a collection of ˙ neurons are firing stochastically, the probability of x depends on both x and the firing behavior s ˙ (s = 1 when spiking, otherwise s = 0) of each neuron, as such: p(x|x, s) = p(x|s = 1)p(s = 1|x) + p(x|s = 0)p(s = 0|x) ˙ ˙ ˙ (11) Eq.11 is a master equation that determines a probability flow on the state. From the Kramers-Moyal expansion we can associate this probability flow with a partial differential equation: ∂ p(x, t) ∂t ∞ − = i=1 ∂ ∂x i D(i) (x)p(x, t) (12) where D(i) (x) is a time-invariant term that modifies the change of probability density based on its i-th gradient. 4 Under certain conditions [7, 8], D(i) (x) for i > 2 all vanish and therefore the probability flow can be described deterministically using a linear operator L: ∂ ∂ ∂ 2 (2) D (x) p(x, t) = Lp(x, t) (13) p(x, t) = − D(1) (x) + ∂t ∂x ∂x2 This means that various Ls can be superimposed to achieve complex system dynamics (illustrated in Fig.2A). B. Equivalent network with sparse interconnections A. Neuron function as superimposed linear operators SN Sensory Input + SN SN SN αMN αMN αMN Motor Output αMN Figure 2: Functions of neuron population can be described as the combination of linear operators (A). Therefore the original neural function can be equivalently produced by sparsely connected neurons formalizing parallel pathways (B). As a consequence, the statistical effect of two fully connected neuron populations is equivalent to ones that are only sparsely connected, as long as the probability flow can be described by the same L. For a movement task, in particular, it is the statistical effect from the neuron ensemble onto skeletal muscles that determines the global behavior. Therefore we argue that it is feasible to approximate the spinal cord connectivity by sparsely interconnecting sensory and motor neurons (Fig.2B). Here a pool of homogenous sensory neurons projects to another pool of homogeneous α-motoneurons. Pseudorandom noise is added to the input of all homogeneous neurons within a population. It is worth noting that this approximation significantly reduces the number of synapses that need to be implemented in hardware. 3 Hardware implementation on FPGA We select FPGA as the implementation device due to its inherent parallelism that resembles the nervous system. FPGA is favored over GPU or clustered CPUs because it is relatively easy to network hundreds of nodes under flexible protocols. The platform is distributed on multiple nodes of Xilinx Spartan-6 devices. The interfacing among FPGAs and computers is created using OpalKelly development board XEM6010. The dynamic range of variables is tight in models of Izhikevich neuron, synapse and EMG. This helps maintaining the accuracy of models even when they are evaluated in 32-bit fixed-point arithmetics. The spindle model, in contrast, requires floating-point arithmetics due to its wide dynamic range and complex calculations (see eq.4-10). Hyper-time computations with floating-point numbers are resource consuming and therefore need to be implemented with special attentions. 3.1 Floating-point arithmetics in combinational logic Our arithmetic implementations are compatible with IEEE-754 standard. Typical floating-point arithmetic IP cores are either pipe-lined or based on iterative algorithms such as CORDIC, all of which require clocks to schedule the calculation. In our platform, no clock is provided for model evaluations thus all arithmetics need to be executed in pure combinational logic. Taking advantage of combinational logic allows all model evaluations to be 1) fast, the evaluation time depends entirely on the propagating and settling time of signals, which is on the order of microseconds, and 2) parallel, each model is evaluated on its own circuit without waiting for any other results. Our implementations of adder and multiplier are inspired by the open source project “Free FloatingPoint Madness”, available at http://www.hmc.edu/chips/. Please contact the authors of this paper if the modified code is needed. 5 Fast combinational floating-point division Floating-point division is even more resource demanding than multiplications. We avoided directly implementing the dividing algorithm by approximating it with additions and multiplications. Our approach is inspired by an algorithm described in [9], which provides a good approximation of the inverse square root for any positive number x within one Newton-Raphson iteration: 1 x Q(x) = √ ≈ x(1.5 − · x2 ) 2 x (x > 0) (14) Q(x) can be implemented only using floating-point adders and multipliers. Thereby any division with a positive divisor can be achieved if two blocks of Q(x) are concatenated: a a (15) = √ √ = a · Q(b) · Q(b) (b > 0) b b· b This algorithm has been adjusted to also work with negative divisors (b < 0). Numerical integrators for differential equations Evaluating the instantaneous states of differential equation models require a fixed-step numerical integrator. Backward Euler’s Method was chosen to balance the numerical error and FPGA usage: x ˙ xn+1 = f (x, t) = xn + T f (xn+1 , tn+1 ) (16) (17) where T is the sampling interval. f (x, t) is the derivative function for state variable x. 3.2 Asynchronous spike-based communication between FPGA chips Clock Spike clean count Counter 1 1 2 1 2 3 Figure 3: Timing diagram of asynchronous spike-based communication FPGA nodes are networked by transferring 1-bit binary spikes to each other. Our design allowed the sender and the receiver to operate on independent clocks without having to synchronize. The timing diagram of the spike-based communication is shown in Fig.3. The sender issues Spike with a pulse width of 1/(365 × Femu ) second. Each Spike then triggers a counting event on the receiver, meanwhile each Clock first reads the accumulated spike count and subsequently cleans the counter. Note that the phase difference between Spike and Clock is not predictable due to asynchronicity. 3.3 Serialize neuron evaluations within a homogeneous population Different neuron populations are instantiated as standalone circuits. Within in each population, however, homogeneous neurons mentioned in Section 2.3 are evaluated in series in order to optimize FPGA usage. Within each FPGA node all modules operate with a central clock, which is the only source allowed to trigger any updating event. Therefore the maximal number of neurons that can be serialized (Nserial ) is restrained by the following relationship: Ffpga = C × Nserial × 365 × Femu (18) Here Ffpga is the fastest clock rate that a FPGA can operate on; C = 4 is the minimal clock cycles needed for updating each state variable in the on-chip memory; Femu = 1 kHz is the time granularity of emulation (1 millisecond), and 365 × Femu represents 365x real-time. Consider that Xilinx 6 Spartan-6 FPGA devices peaks at 200MHz central clock frequency, the theoretical maximum of neurons that can be serialized is Nserial 200 MHz/(4 × 365 × 1 kHz) ≈ 137 (19) In the current design we choose Nserial = 128. 4 Results: emulated activities of motor nervous system Figure 4 shows the implemented monosynaptic spinal loop in schematics and in operation. Each FPGA node is able to emulate monosynaptic spinal loops consisting of 1,024 sensory and 1,024 motor neurons, i.e. 2,048 neurons in total. The spike-based asynchronous communication is successful between two FPGA nodes. Note that the emulation has to be significantly slowed down for on-line plotting. When the emulation is at full speed (365x real-time) the software front-end is not able to visualize the signals due to limited data throughput. 128 SNs 128 αMNs SN αMN 128 SNs 128 αMNs SN αMN ... 8 parallel pathways 2,048 neurons Figure 4: The neural emulation platform in operation. Left: Neural circuits implemented for each FPGA node including 2,048 neurons. SN = Sensory Neuron; αMN = α-motoneuron. Center: One working FPGA node. Right: Two FPGA nodes networked using asynchronous spiking protocol. The emulation platform successfully created multi-scale information when the muscle is externally stretched (Fig.5A). We also tested if our emulated motor system is able to produce the recruitment order and size principles observed in real physiological data. It has been well known that when a voluntary motor command is sent to the α-motoneuron pool, the motor units are recruited in an order that small ones get recruited first, followed by the big ones [10]. The comparison between our results and real data are shown in Fig.5B, where the top panel shows 20 motor unit activities emulated using our platform, and the bottom panel shows decoded motor unit activities from real human EMG [11]. No qualitative difference was found. 5 Discussion and future work We designed a hardware platform for emulating the multi-scale motor nervous activities in hypertime. We managed to use one node of single Xilinx Spartan-6 FPGA to emulate monosynaptic spinal loops consisting of 2,048 neurons, associated muscles and proprioceptors. The neurons are organized as parallel pathways with sparse interconnections. The emulation is successfully accelerated to 365x real-time. The platform can be scaled by networking multiple FPGA nodes, which is enabled by an asynchronous spike-based communication protocol. The emulated monosynaptic spinal loops are capable of producing reflex-like activities in response to muscle stretch. Our results of motor unit recruitment order are compatible with the physiological data collected in real human subjects. There is a question of whether this stochastic system turns out chaotic, especially with accumulated errors from Backward Euler’s integrator. Note that the firing property of a neuron population is usually stable even with explicit noise [8], and spindle inputs are measured from real robots so the integrator errors are corrected at every iteration. To our knowledge, the system is not critically sensitive to the initial conditions or integrator errors. This question, however, is both interesting and important for in-depth investigations in the future. 7 It has been shown [12] that replicating classic types of spinal interneurons (propriospinal, Iaexcitatory, Ia-inhibitory, Renshaw, etc.) is sufficient to produce stabilizing responses and rapid reaching movement in a wrist. Our platform will introduce those interneurons to describe the known spinal circuitry in further details. Physiological models will also be refined as needed. For the purpose of modeling movement behavior or diseases, Izhikevich model is a good balance between verisimilitude and computational cost. Nevertheless when testing drug effects along disease progression, neuron models are expected to cover sufficient molecular details including how neurotransmitters affect various ion channels. With the advancing of programmable semiconductor technology, it is expected to upgrade our neuron model to Hodgkin-Huxley’s. For the muscle models, Hill’s type of model does not fit the muscle properties accurately enough when the muscle is being shortened. Alternative models will be tested. Other studies showed that the functional dexterity of human limbs – especially in the hands – is critically enabled by the tendon configurations and joint geometry [13]. As a result, if our platform is used to understand whether known neurophysiology and biomechanics are sufficient to produce able and pathological movements, it will be necessary to use this platform to control human-like limbs. Since the emulation speed can be flexibly adjusted from arbitrarily slow to 365x real-time, when speeded to exactly 1x real-time the platform will function as a digital controller with 1kHz refresh rate. The main purpose of the emulation is to learn how certain motor disorders progress during childhood development. This first requires the platform to reproduce motor symptoms that are compatible with clinical observations. For example it has been suggested that muscle spasticity in rats is associated with decreased soma size of α-motoneurons [14], which presumably reduced the firing threshold of neurons. Thus when lower firing threshold is introduced to the emulated motoneuron pool, similar EMG patterns as in [15] should be observed. It is also necessary for the symptoms to evolve with neural plasticity. In the current version we presume that the structure of each component remains time invariant. In the future work Spike Timing Dependent Plasticity (STDP) will be introduced such that all components are subject to temporal modifications. B. Verify motor unit recruitment pattern A. Multi-scale activities from emulation Emulation 1s Stretch Spindle Ia Sensory post-synaptic current Real Data Motoneurons Muscle Force EMG Figure 5: A) Physiological activity emulated by each model when the muscle is sinusoidally stretched. B) Comparing the emulated motor unit recruitment order with real experimental data. Acknowledgments The authors thank Dr. Gerald Loeb for helping set up the emulation of spindle models. This project is supported by NIH NINDS grant R01NS069214-02. 8 References [1] Izhikevich, E. M. Simple model of spiking neurons. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council 14, 1569–1572 (2003). [2] Glowatzki, E. & Fuchs, P. A. Transmitter release at the hair cell ribbon synapse. Nature neuroscience 5, 147–154 (2002). [3] Shadmehr, R. & Wise, S. P. A Mathematical Muscle Model. In Supplementary documents for “Computational Neurobiology of Reaching and Pointing”, 1–18 (MIT Press, Cambridge, MA, 2005). [4] Fuglevand, A. J., Winter, D. A. & Patla, A. E. Models of recruitment and rate coding organization in motor-unit pools. Journal of neurophysiology 70, 2470–2488 (1993). [5] Mileusnic, M. P., Brown, I. E., Lan, N. & Loeb, G. E. Mathematical models of proprioceptors. I. Control and transduction in the muscle spindle. Journal of neurophysiology 96, 1772–1788 (2006). [6] Gelfan, S., Kao, G. & Ruchkin, D. S. The dendritic tree of spinal neurons. The Journal of comparative neurology 139, 385–411 (1970). [7] Sanger, T. D. Neuro-mechanical control using differential stochastic operators. In Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE, 4494–4497 (2010). [8] Sanger, T. D. Distributed control of uncertain systems using superpositions of linear operators. Neural computation 23, 1911–1934 (2011). [9] Lomont, C. Fast inverse square root (2003). URL http://www.lomont.org/Math/Papers/ 2003/InvSqrt.pdf. [10] Henneman, E. Relation between size of neurons and their susceptibility to discharge. Science (New York, N.Y.) 126, 1345–1347 (1957). [11] De Luca, C. J. & Hostage, E. C. Relationship between firing rate and recruitment threshold of motoneurons in voluntary isometric contractions. Journal of neurophysiology 104, 1034–1046 (2010). [12] Raphael, G., Tsianos, G. A. & Loeb, G. E. Spinal-like regulator facilitates control of a two-degree-offreedom wrist. The Journal of neuroscience : the official journal of the Society for Neuroscience 30, 9431–9444 (2010). [13] Valero-Cuevas, F. J. et al. The tendon network of the fingers performs anatomical computation at a macroscopic scale. IEEE transactions on bio-medical engineering 54, 1161–1166 (2007). [14] Brashear, A. & Elovic, E. Spasticity: Diagnosis and Management (Demos Medical, 2010), 1 edn. [15] Levin, M. F. & Feldman, A. G. The role of stretch reflex threshold regulation in normal and impaired motor control. Brain research 657, 23–30 (1994). 9

4 0.70513189 190 nips-2012-Learning optimal spike-based representations

Author: Ralph Bourdoukan, David Barrett, Sophie Deneve, Christian K. Machens

Abstract: How can neural networks learn to represent information optimally? We answer this question by deriving spiking dynamics and learning dynamics directly from a measure of network performance. We find that a network of integrate-and-fire neurons undergoing Hebbian plasticity can learn an optimal spike-based representation for a linear decoder. The learning rule acts to minimise the membrane potential magnitude, which can be interpreted as a representation error after learning. In this way, learning reduces the representation error and drives the network into a robust, balanced regime. The network becomes balanced because small representation errors correspond to small membrane potentials, which in turn results from a balance of excitation and inhibition. The representation is robust because neurons become self-correcting, only spiking if the representation error exceeds a threshold. Altogether, these results suggest that several observed features of cortical dynamics, such as excitatory-inhibitory balance, integrate-and-fire dynamics and Hebbian plasticity, are signatures of a robust, optimal spike-based code. A central question in neuroscience is to understand how populations of neurons represent information and how they learn to do so. Usually, learning and information representation are treated as two different functions. From the outset, this separation seems like a good idea, as it reduces the problem into two smaller, more manageable chunks. Our approach, however, is to study these together. This allows us to treat learning and information representation as two sides of a single mechanism, operating at two different timescales. Experimental work has given us several clues about the regime in which real networks operate in the brain. Some of the most prominent observations are: (a) high trial-to-trial variability—a neuron responds differently to repeated, identical inputs [1, 2]; (b) asynchronous firing at the network level—spike trains of different neurons are at most very weakly correlated [3, 4, 5]; (c) tight balance of excitation and inhibition—every excitatory input is met by an inhibitory input of equal or greater size [6, 7, 8] and (4) spike-timing-dependent plasticity (STDP)—the strength of synapses change as a function of presynaptic and postsynaptic spike times [9]. Previously, it has been shown that observations (a)–(c) can be understood as signatures of an optimal, spike-based code [10, 11]. The essential idea is to derive spiking dynamics from the assumption that neurons only fire if their spike improves information representation. Information in a network may ∗ Authors contributed equally 1 originate from several possible sources: external sensory input, external neural network input, or alternatively, it may originate within the network itself as a memory, or as a computation. Whatever the source, this initial assumption leads directly to the conclusion that a network of integrate-and-fire neurons can optimally represent a signal while exhibiting properties (a)–(c). A major problem with this framework is that network connectivity must be completely specified a priori, and requires the tuning of N 2 parameters, where N is the number of neurons in the network. Although this is feasible mathematically, it is unclear how a real network could tune itself into this optimal regime. In this work, we solve this problem using a simple synaptic learning rule. The key insight is that the plasticity rule can be derived from the same basic principle as the spiking rule in the earlier work—namely, that any change should improve information representation. Surprisingly, this can be achieved with a local, Hebbian learning rule, where synaptic plasticity is proportional to the product of presynaptic firing rates with post-synaptic membrane potentials. Spiking and synaptic plasticity then work hand in hand towards the same goal: the spiking of a neuron decreases the representation error on a fast time scale, thereby giving rise to the actual population representation; synaptic plasticity decreases the representation error on a slower time scale, thereby improving or maintaining the population representation. For a large set of initial connectivities and spiking dynamics, neural networks are driven into a balanced regime, where excitation and inhibition cancel each other and where spike trains are asynchronous and irregular. Furthermore, the learning rule that we derive reproduces the main features of STDP (property (d) above). In this way, a network can learn to represent information optimally, with synaptic, neural and network dynamics consistent with those observed experimentally. 1 Derivation of the learning rule for a single neuron We begin by deriving a learning rule for a single neuron with an autapse (a self-connection) (Fig. 1A). Our approach is to derive synaptic dynamics for the autapse and spiking dynamics for the neuron such that the neuron learns to optimally represent a time-varying input signal. We will derive a learning rule for networks of neurons later, after we have developed the fundamental concepts for the single neuron case. Our first step is to derive optimal spiking dynamics for the neuron, so that we have a target for our learning rule. We do this by making two simple assumptions [11]. First, we assume that the neuron can provide an estimate or read-out x(t) of a time-dependent signal x(t) by filtering its spike train ˆ o(t) as follows: ˙ x(t) = −ˆ(t) + Γo(t), ˆ x (1) where Γ is a fixed read-out weight, which we will refer to as the neuron’s “output kernel” and the spike train can be written as o(t) = i δ(t − ti ), where {ti } are the spike times. Next, we assume that the neuron only produces a spike if that spike improves the read-out, where we measure the read-out performance through a simple squared-error loss function: 2 L(t) = x(t) − x(t) . ˆ (2) With these two assumptions, we can now derive optimal spiking dynamics. First, we observe that if the neuron produces an additional spike at time t, the read-out increases by Γ, and the loss function becomes L(t|spike) = (x(t) − (x(t) + Γ))2 . This allows us to restate our spiking rule as follows: ˆ the neuron should only produce a spike if L(t|no spike) > L(t|spike), or (x(t) − x(t))2 > (x(t) − ˆ (x(t) + Γ))2 . Now, squaring both sides of this inequality, defining V (t) ≡ Γ(x(t) − x(t)) and ˆ ˆ defining T ≡ Γ2 /2 we find that the neuron should only spike if: V (t) > T. (3) We interpret V (t) to be the membrane potential of the neuron, and we interpret T as the spike threshold. This interpretation allows us to understand the membrane potential functionally: the voltage is proportional to a prediction error—the difference between the read-out x(t) and the actual ˆ signal x(t). A spike is an error reduction mechanism—the neuron only spikes if the error exceeds the spike threshold. This is a greedy minimisation, in that the neuron fires a spike whenever that action decreases L(t) without considering the future impact of that spike. Importantly, the neuron does not require direct access to the loss function L(t). 2 To determine the membrane potential dynamics, we take the derivative of the voltage, which gives ˙ ˙ us V = Γ(x − x). (Here, and in the following, we will drop the time index for notational brevity.) ˙ ˆ ˙ Now, using Eqn. (1) we obtain V = Γx − Γ(−x + Γo) = −Γ(x − x) + Γ(x + x) − Γ2 o, so that: ˙ ˆ ˆ ˙ ˙ V = −V + Γc − Γ2 o, (4) where c = x + x is the neural input. This corresponds exactly to the dynamics of a leaky integrate˙ and-fire neuron with an inhibitory autapse1 of strength Γ2 , and a feedforward connection strength Γ. The dynamics and connectivity guarantee that a neuron spikes at just the right times to optimise the loss function (Fig. 1B). In addition, it is especially robust to noise of different forms, because of its error-correcting nature. If x is constant in time, the voltage will rise up to the threshold T at which point a spike is fired, adding a delta function to the spike train o at time t, thereby producing a read-out x that is closer to x and causing an instantaneous drop in the voltage through the autapse, ˆ by an amount Γ2 = 2T , effectively resetting the voltage to V = −T . We now have a target for learning—we know the connection strength that a neuron must have at the end of learning if it is to represent information optimally, for a linear read-out. We can use this target to derive synaptic dynamics that can learn an optimal representation from experience. Specifically, we consider an integrate-and-fire neuron with some arbitrary autapse strength ω. The dynamics of this neuron are given by ˙ V = −V + Γc − ωo. (5) This neuron will not produce the correct spike train for representing x through a linear read-out (Eqn. (1)) unless ω = Γ2 . Our goal is to derive a dynamical equation for the synapse ω so that the spike train becomes optimal. We do this by quantifying the loss that we are incurring by using the suboptimal strength, and then deriving a learning rule that minimises this loss with respect to ω. The loss function underlying the spiking dynamics determined by Eqn. (5) can be found by reversing the previous membrane potential analysis. First, we integrate the differential equation for V , assuming that ω changes on time scales much slower than the membrane potential. We obtain the following (formal) solution: V = Γx − ω¯, o (6) ˙ where o is determined by o = −¯ + o. The solution to this latter equation is o = h ∗ o, a convolution ¯ ¯ o ¯ of the spike train with the exponential kernel h(τ ) = θ(τ ) exp(−τ ). As such, it is analogous to the instantaneous firing rate of the neuron. Now, using Eqn. (6), and rewriting the read-out as x = Γ¯, we obtain the loss incurred by the ˆ o sub-optimal neuron, L = (x − x)2 = ˆ 1 V 2 + 2(ω − Γ2 )¯ + (ω − Γ2 )2 o2 . o ¯ Γ2 (7) We observe that the last two terms of Eqn. (7) will vanish whenever ω = Γ2 , i.e., when the optimal reset has been found. We can therefore simplify the problem by defining an alternative loss function, 1 2 V , (8) 2 which has the same minimum as the original loss (V = 0 or x = x, compare Eqn. (2)), but yields a ˆ simpler learning algorithm. We can now calculate how changes to ω affect LV : LV = ∂LV ∂V ∂o ¯ =V = −V o − V ω ¯ . (9) ∂ω ∂ω ∂ω We can ignore the last term in this equation (as we will show below). Finally, using simple gradient descent, we obtain a simple Hebbian-like synaptic plasticity rule: τω = − ˙ ∂LV = V o, ¯ ∂ω (10) where τ is the learning time constant. 1 This contribution of the autapse can also be interpreted as the reset of an integrate-and-fire neuron. Later, when we generalise to networks of neurons, we shall employ this interpretation. 3 This synaptic learning rule is capable of learning the synaptic weight ω that minimises the difference between x and x (Fig. 1B). During learning, the synaptic weight changes in proportion to the postˆ synaptic voltage V and the pre-synaptic firing rate o (Fig. 1C). As such, this is a Hebbian learning ¯ rule. Of course, in this single neuron case, the pre-synaptic neuron and post-synaptic neuron are the same neuron. The synaptic weight gradually approaches its optimal value Γ2 . However, it never completely stabilises, because learning never stops as long as neurons are spiking. Instead, the synapse oscillates closely about the optimal value (Fig. 1D). This is also a “greedy” learning rule, similar to the spiking rule, in that it seeks to minimise the error at each instant in time, without regard for the future impact of those changes. To demonstrate that the second term in Eqn. (5) can be neglected we note that the equations for V , o, and ω define a system ¯ of coupled differential equations that can be solved analytically by integrating between spikes. This results in a simple recurrence relation for changes in ω from the ith to the (i + 1)th spike, ωi+1 = ωi + ωi (ωi − 2T ) . τ (T − Γc − ωi ) (11) This iterative equation has a single stable fixed point at ω = 2T = Γ2 , proving that the neuron’s autaptic weight or reset will approach the optimal solution. 2 Learning in a homogeneous network We now generalise our learning rule derivation to a network of N identical, homogeneously connected neurons. This generalisation is reasonably straightforward because many characteristics of the single neuron case are shared by a network of identical neurons. We will return to the more general case of heterogeneously connected neurons in the next section. We begin by deriving optimal spiking dynamics, as in the single neuron case. This provides a target for learning, which we can then use to derive synaptic dynamics. As before, we want our network to produce spikes that optimally represent a variable x for a linear read-out. We assume that the read-out x is provided by summing and filtering the spike trains of all the neurons in the network: ˆ ˙ x = −ˆ + Γo, ˆ x (12) 2 where the row vector Γ = (Γ, . . . , Γ) contains the read-out weights of the neurons and the column vector o = (o1 , . . . , oN ) their spike trains. Here, we have used identical read-out weights for each neuron, because this indirectly leads to homogeneous connectivity, as we will demonstrate. Next, we assume that a neuron only spikes if that spike reduces a loss-function. This spiking rule is similar to the single neuron spiking rule except that this time there is some ambiguity about which neuron should spike to represent a signal. Indeed, there are many different spike patterns that provide exactly the same estimate x. For example, one neuron could fire regularly at a high rate (exactly like ˆ our previous single neuron example) while all others are silent. To avoid this firing rate ambiguity, we use a modified loss function, that selects amongst all equivalent solutions, those with the smallest neural firing rates. We do this by adding a ‘metabolic cost’ term to our loss function, so that high firing rates are penalised: ¯ L = (x − x)2 + µ o 2 , ˆ (13) where µ is a small positive constant that controls the cost-accuracy trade-off, akin to a regularisation parameter. Each neuron in the optimal network will seek to reduce this loss function by firing a spike. Specifically, the ith neuron will spike whenever L(no spike in i) > L(spike in i). This leads to the following spiking rule for the ith neuron: Vi > Ti (14) where Vi ≡ Γ(x − x) − µoi and Ti ≡ Γ2 /2 + µ/2. We can naturally interpret Vi as the membrane ˆ potential of the ith neuron and Ti as the spiking threshold of that neuron. As before, we can now derive membrane potential dynamics: ˙ V = −V + ΓT c − (ΓT Γ + µI)o, 2 (15) The read-out weights must scale as Γ ∼ 1/N so that firing rates are not unrealistically small in large networks. We can see this by calculating the average firing rate N oi /N ≈ x/(ΓN ) ∼ O(N/N ) ∼ O(1). i=1 ¯ 4 where I is the identity matrix and ΓT Γ + µI is the network connectivity. We can interpret the selfconnection terms {Γ2 +µ} as voltage resets that decrease the voltage of any neuron that spikes. This optimal network is equivalent to a network of identical integrate-and-fire neurons with homogeneous inhibitory connectivity. The network has some interesting dynamical properties. The voltages of all the neurons are largely synchronous, all increasing to the spiking threshold at about the same time3 (Fig. 1F). Nonetheless, neural spiking is asynchronous. The first neuron to spike will reset itself by Γ2 + µ, and it will inhibit all the other neurons in the network by Γ2 . This mechanism prevents neurons from spik- x 3 The first neuron to spike will be random if there is some membrane potential noise. V (A) (B) x x ˆ x 10 1 0.1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 1 D 0.5 V V 0 ˆ x V ˆ x (C) 1 0 1 2 0 0.625 25 25.625 (D) start of learning 1 V 50 200.625 400 400.625 1 2.4 O 1.78 ω 1.77 25 neuron$ 0 1 2 !me$ 3 4 25 1 5 V 400.625 !me$ (F) 25 1 2.35 1.05 1.049 400 25.625 !me$ (E) neuron$ 100.625 200 end of learning 1.4 1.35 ω 100 !me$ 1 V 1 O 50.625 0 1 2 !me$ 3 4 5 V !me$ !me$ Figure 1: Learning in a single neuron and a homogeneous network. (A) A single neuron represents an input signal x by producing an output x. (B) During learning, the single neuron output x (solid red ˆ ˆ line, top panel) converges towards the input x (blue). Similarly, for a homogeneous network the output x (dashed red line, top panel) converges towards x. Connectivity also converges towards optimal ˆ connectivity in both the single neuron case (solid black line, middle panel) and the homogeneous net2 2 work case (dashed black line, middle panel), as quantified by D = maxi,j ( Ωij − Ωopt / Ωopt ) ij ij at each point in time. Consequently, the membrane potential reset (bottom panel) converges towards the optimal reset (green line, bottom panel). Spikes are indicated by blue vertical marks, and are produced when the membrane potential reaches threshold (bottom panel). Here, we have rescaled time, as indicated, for clarity. (C) Our learning rule dictates that the autapse ω in our single neuron (bottom panel) changes in proportion to the membrane potential (top panel) and the firing rate (middle panel). (D) At the end of learning, the reset ω fluctuates weakly about the optimal value. (E) For a homogeneous network, neurons spike regularly at the start of learning, as shown in this raster plot. Membrane potentials of different neurons are weakly correlated. (F) At the end of learning, spiking is very irregular and membrane potentials become more synchronous. 5 ing synchronously. The population as a whole acts similarly to the single neuron in our previous example. Each neuron fires regularly, even if a different neuron fires in every integration cycle. The design of this optimal network requires the tuning of N (N − 1) synaptic parameters. How can an arbitrary network of integrate-and-fire neurons learn this optimum? As before, we address this question by using the optimal network as a target for learning. We start with an arbitrarily connected network of integrate-and-fire neurons: ˙ V = −V + ΓT c − Ωo, (16) where Ω is a matrix of connectivity weights, which includes the resets of the individual neurons. Assuming that learning occurs on a slow time scale, we can rewrite this equation as V = ΓT x − Ω¯ . o (17) Now, repeating the arguments from the single neuron derivation, we modify the loss function to obtain an online learning rule. Specifically, we set LV = V 2 /2, and calculate the gradient: ∂LV = ∂Ωij Vk k ∂Vk =− ∂Ωij Vk δki oj − ¯ k Vk Ωkl kl ∂ ol ¯ . ∂Ωij (18) We can simplify this equation considerably by observing that the contribution of the second summation is largely averaged out under a wide variety of realistic conditions4 . Therefore, it can be neglected, and we obtain the following local learning rule: ∂LV ˙ = V i oj . ¯ τ Ωij = − ∂Ωij (19) This is a Hebbian plasticity rule, whereby connectivity changes in proportion to the presynaptic firing rate oj and post-synaptic membrane potential Vi . We assume that the neural thresholds are set ¯ to a constant T and that the neural resets are set to their optimal values −T . In the previous section we demonstrated that these resets can be obtained by a Hebbian plasticity rule (Eqn. (10)). This learning rule minimises the difference between the read-out and the signal, by approaching the optimal recurrent connection strengths for the network (Fig. 1B). As in the single neuron case, learning does not stop, so the connection strengths fluctuate close to their optimal value. During learning, network activity becomes progressively more asynchronous as it progresses towards optimal connectivity (Fig. 1E, F). 3 Learning in the general case Now that we have developed the fundamental concepts underlying our learning rule, we can derive a learning rule for the more general case of a network of N arbitrarily connected leaky integrateand-fire neurons. Our goal is to understand how such networks can learn to optimally represent a ˙ J-dimensional signal x = (x1 , . . . , xJ ), using the read-out equation x = −x + Γo. We consider a network with the following membrane potential dynamics: ˙ V = −V + ΓT c − Ωo, (20) where c is a J-dimensional input. We assume that this input is related to the signal according to ˙ c = x + x. This assumption can be relaxed by treating the input as the control for an arbitrary linear dynamical system, in which case the signal represented by the network is the output of such a computation [11]. However, this further generalisation is beyond the scope of this work. As before, we need to identify the optimal recurrent connectivity so that we have a target for learning. Most generally, the optimal recurrent connectivity is Ωopt ≡ ΓT Γ + µI. The output kernels of the individual neurons, Γi , are given by the rows of Γ, and their spiking thresholds by Ti ≡ Γi 2 /2 + 4 From the definition of the membrane potential we can see that Vk ∼ O(1/N ) because Γ ∼ 1/N . Therefore, the size of the first term in Eqn. (18) is k Vk δki oj = Vi oj ∼ O(1/N ). Therefore, the second term can ¯ ¯ be ignored if kl Vk Ωkl ∂ ol /∂Ωij ¯ O(1/N ). This happens if Ωkl O(1/N 2 ) as at the start of learning. It also happens towards the end of learning if the terms {Ωkl ∂ ol /∂Ωij } are weakly correlated with zero mean, ¯ or if the membrane potentials {Vi } are weakly correlated with zero mean. 6 µ/2. With these connections and thresholds, we find that a network of integrate-and-fire neurons ˆ ¯ will produce spike trains in such a way that the loss function L = x − x 2 + µ o 2 is minimised, ˆ where the read-out is given by x = Γ¯ . We can show this by prescribing a greedy5 spike rule: o a spike is fired by neuron i whenever L(no spike in i) > L(spike in i) [11]. The resulting spike generation rule is Vi > Ti , (21) ˆ where Vi ≡ ΓT (x − x) − µ¯i is interpreted as the membrane potential. o i 5 Despite being greedy, this spiking rule can generate firing rates that are practically identical to the optimal solutions: we checked this numerically in a large ensemble of networks with randomly chosen kernels. (A) x1 … x … 1 1 (B) xJJ x 10 L 10 T T 10 4 6 8 1 Viii V D ˆˆ ˆˆ x11 xJJ x x F 0.5 0 0.4 … … 0.2 0 0 2000 4000 !me   (C) x V V 1 x 10 x 3 ˆ x 8 0 x 10 1 2 3 !me   4 5 4 0 1 4 0 1 8 V (F) Ρ(Δt)   E-­‐I  input   0.4 ˆ x 0 3 0 1 x 10 1.3 0.95 x 10 ˆ x 4 V (E) 1 x 0 end of learning 50 neuron neuron 50 !me   2 0 ˆ x 0 0.5 ISI  Δt     1 2 !me   4 5 4 1.5 1.32 3 2 0.1 Ρ(Δt)   x E-­‐I  input   (D) start of learning 0 2 !me   0 0 0.5 ISI  Δt   1 Figure 2: Learning in a heterogeneous network. (A) A network of neurons represents an input ˆ signal x by producing an output x. (B) During learning, the loss L decreases (top panel). The difference between the connection strengths and the optimal strengths also decreases (middle panel), as 2 2 quantified by the mean difference (solid line), given by D = Ω − Ωopt / Ωopt and the maxi2 2 mum difference (dashed line), given by maxi,j ( Ωij − Ωopt / Ωopt ). The mean population firing ij ij rate (solid line, bottom panel) also converges towards the optimal firing rate (dashed line, bottom panel). (C, E) Before learning, a raster plot of population spiking shows that neurons produce bursts ˆ of spikes (upper panel). The network output x (red line, middle panel) fails to represent x (blue line, middle panel). The excitatory input (red, bottom left panel) and inhibitory input (green, bottom left panel) to a randomly selected neuron is not tightly balanced. Furthermore, a histogram of interspike intervals shows that spiking activity is not Poisson, as indicated by the red line that represents a best-fit exponential distribution. (D, F) At the end of learning, spiking activity is irregular and ˆ Poisson-like, excitatory and inhibitory input is tightly balanced and x matches x. 7 How can we learn this optimal connection matrix? As before, we can derive a learning rule by minimising the cost function LV = V 2 /2. This leads to a Hebbian learning rule with the same form as before: ˙ τ Ωij = Vi oj . ¯ (22) Again, we assume that the neural resets are given by −Ti . Furthermore, in order for this learning rule to work, we must assume that the network input explores all possible directions in the J-dimensional input space (since the kernels Γi can point in any of these directions). The learning performance does not critically depend on how the input variable space is sampled as long as the exploration is extensive. In our simulations, we randomly sample the input c from a Gaussian white noise distribution at every time step for the entire duration of the learning. We find that this learning rule decreases the loss function L, thereby approaching optimal network connectivity and producing optimal firing rates for our linear decoder (Fig. 2B). In this example, we have chosen connectivity that is initially much too weak at the start of learning. Consequently, the initial network behaviour is similar to a collection of unconnected single neurons that ignore each other. Spike trains are not Poisson-like, firing rates are excessively large, excitatory and inhibitory ˆ input is unbalanced and the decoded variable x is highly unreliable (Fig. 2C, E). As a result of learning, the network becomes tightly balanced and the spike trains become asynchronous, irregular and Poisson-like with much lower rates (Fig. 2D, F). However, despite this apparent variability, the population representation is extremely precise, only limited by the the metabolic cost and the discrete nature of a spike. This learnt representation is far more precise than a rate code with independent Poisson spike trains [11]. In particular, shuffling the spike trains in response to identical inputs drastically degrades this precision. 4 Conclusions and Discussion In population coding, large trial-to-trial spike train variability is usually interpreted as noise [2]. We show here that a deterministic network of leaky integrate-and-fire neurons with a simple Hebbian plasticity rule can self-organise into a regime where information is represented far more precisely than in noisy rate codes, while appearing to have noisy Poisson-like spiking dynamics. Our learning rule (Eqn. (22)) has the basic properties of STDP. Specifically, a presynaptic spike occurring immediately before a post-synaptic spike will potentiate a synapse, because membrane potentials are positive immediately before a postsynaptic spike. Furthermore, a presynaptic spike occurring immediately after a post-synaptic spike will depress a synapse, because membrane potentials are always negative immediately after a postsynaptic spike. This is similar in spirit to the STDP rule proposed in [12], but different to classical STDP, which depends on post-synaptic spike times [9]. This learning rule can also be understood as a mechanism for generating a tight balance between excitatory and inhibitory input. We can see this by observing that membrane potentials after learning can be interpreted as representation errors (projected onto the read-out kernels). Therefore, learning acts to minimise the magnitude of membrane potentials. Excitatory and inhibitory input must be balanced if membrane potentials are small, so we can equate balance with optimal information representation. Previous work has shown that the balanced regime produces (quasi-)chaotic network dynamics, thereby accounting for much observed cortical spike train variability [13, 14, 4]. Moreover, the STDP rule has been known to produce a balanced regime [16, 17]. Additionally, recent theoretical studies have suggested that the balanced regime plays an integral role in network computation [15, 13]. In this work, we have connected these mechanisms and functions, to conclude that learning this balance is equivalent to the development of an optimal spike-based population code, and that this learning can be achieved using a simple Hebbian learning rule. Acknowledgements We are grateful for generous funding from the Emmy-Noether grant of the Deutsche Forschungsgemeinschaft (CKM) and the Chaire d’excellence of the Agence National de la Recherche (CKM, DB), as well as a James Mcdonnell Foundation Award (SD) and EU grants BACS FP6-IST-027140, BIND MECT-CT-20095-024831, and ERC FP7-PREDSPIKE (SD). 8 References [1] Tolhurst D, Movshon J, and Dean A (1982) The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res 23: 775–785. [2] Shadlen MN, Newsome WT (1998) The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci 18(10): 3870–3896. [3] Zohary E, Newsome WT (1994) Correlated neuronal discharge rate and its implication for psychophysical performance. Nature 370: 140–143. [4] Renart A, de la Rocha J, Bartho P, Hollender L, Parga N, Reyes A, & Harris, KD (2010) The asynchronous state in cortical circuits. Science 327, 587–590. [5] Ecker AS, Berens P, Keliris GA, Bethge M, Logothetis NK, Tolias AS (2010) Decorrelated neuronal firing in cortical microcircuits. Science 327: 584–587. [6] Okun M, Lampl I (2008) Instantaneous correlation of excitation and inhibition during ongoing and sensory-evoked activities. Nat Neurosci 11, 535–537. [7] Shu Y, Hasenstaub A, McCormick DA (2003) Turning on and off recurrent balanced cortical activity. Nature 423, 288–293. [8] Gentet LJ, Avermann M, Matyas F, Staiger JF, Petersen CCH (2010) Membrane potential dynamics of GABAergic neurons in the barrel cortex of behaving mice. Neuron 65: 422–435. [9] Caporale N, Dan Y (2008) Spike-timing-dependent plasticity: a Hebbian learning rule. Annu Rev Neurosci 31: 25–46. [10] Boerlin M, Deneve S (2011) Spike-based population coding and working memory. PLoS Comput Biol 7, e1001080. [11] Boerlin M, Machens CK, Deneve S (2012) Predictive coding of dynamic variables in balanced spiking networks. under review. [12] Clopath C, B¨ sing L, Vasilaki E, Gerstner W (2010) Connectivity reflects coding: a model of u voltage-based STDP with homeostasis. Nat Neurosci 13(3): 344–352. [13] van Vreeswijk C, Sompolinsky H (1998) Chaotic balanced state in a model of cortical circuits. Neural Comput 10(6): 1321–1371. [14] Brunel N (2000) Dynamics of sparsely connected networks of excitatory and inhibitory neurons. J Comput Neurosci 8, 183–208. [15] Vogels TP, Rajan K, Abbott LF (2005) Neural network dynamics. Annu Rev Neurosci 28: 357–376. [16] Vogels TP, Sprekeler H, Zenke F, Clopath C, Gerstner W. (2011) Inhibitory plasticity balances excitation and inhibition in sensory pathways and memory networks. Science 334(6062):1569– 73. [17] Song S, Miller KD, Abbott LF (2000) Competitive Hebbian learning through spike-timingdependent synaptic plasticity. Nat Neurosci 3(9): 919–926. 9

5 0.68525678 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter

Author: Dmitri B. Chklovskii, Daniel Soudry

Abstract: We test the hypothesis that the neuronal spike generation mechanism is an analog-to-digital (AD) converter encoding rectified low-pass filtered summed synaptic currents into a spike train linearly decodable in postsynaptic neurons. Faithful encoding of an analog waveform by a binary signal requires that the spike generation mechanism has a sampling rate exceeding the Nyquist rate of the analog signal. Such oversampling is consistent with the experimental observation that the precision of the spikegeneration mechanism is an order of magnitude greater than the cut -off frequency of low-pass filtering in dendrites. Additional improvement in the coding accuracy may be achieved by noise-shaping, a technique used in signal processing. If noise-shaping were used in neurons, it would reduce coding error relative to Poisson spike generator for frequencies below Nyquist by introducing correlations into spike times. By using experimental data from three different classes of neurons, we demonstrate that biological neurons utilize noise-shaping. Therefore, the spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter. The nature of the neural spike code remains a central problem in neuroscience [1-3]. In particular, no consensus exists on whether information is encoded in firing rates [4, 5] or individual spike timing [6, 7]. On the single-neuron level, evidence exists to support both points of view. On the one hand, post-synaptic currents are low-pass-filtered by dendrites with the cut-off frequency of approximately 30Hz [8], Figure 1B, providing ammunition for the firing rate camp: if the signal reaching the soma is slowly varying, why would precise spike timing be necessary? On the other hand, the ability of the spike-generation mechanism to encode harmonics of the injected current up to about 300Hz [9, 10], Figure 1B, points at its exquisite temporal precision [11]. Yet, in view of the slow variation of the somatic current, such precision may seem gratuitous and puzzling. The timescale mismatch between gradual variation of the somatic current and high precision of spike generation has been addressed previously. Existing explanations often rely on the population nature of the neural code [10, 12]. Although this is a distinct possibility, the question remains whether invoking population coding is necessary. Other possible explanations for the timescale mismatch include the possibility that some synaptic currents (for example, GABAergic) may be generated by synapses proximal to the soma and therefore not subject to low-pass filtering or that the high frequency harmonics are so strong in the pre-synaptic spike that despite attenuation, their trace is still present. Although in some cases, these explanations could apply, for the majority of synaptic inputs to typical neurons there is a glaring mismatch. The perceived mismatch between the time scales of somatic currents and the spike-generation mechanism can be resolved naturally if one views spike trains as digitally encoding analog somatic currents [13-15], Figure 1A. Although somatic currents vary slowly, information that could be communicated by their analog amplitude far exceeds that of binary signals, such as all- or-none spikes, of the same sampling rate. Therefore, faithful digital encoding requires sampling rate of the digital signal to be much higher than the cut-off frequency of the analog signal, socalled over-sampling. Although the spike generation mechanism operates in continuous time, the high temporal precision of the spikegeneration mechanism may be viewed as a manifestation of oversampling, which is needed for the digital encoding of the analog signal. Therefore, the extra order of magnitude in temporal precision available to the spike-generation mechanism relative to somatic current, Figure 1B, is necessary to faithfully encode the amplitude of the analog signal, thus potentially reconciling the firing rate and the spike timing points of view [13-15]. Figure 1. Hybrid digital-analog operation of neuronal circuits. A. Post-synaptic currents are low-pass filtered and summed in dendrites (black) to produce a somatic current (blue). This analog signal is converted by the spike generation mechanism into a sequence of all-or-none spikes (green), a digital signal. Spikes propagate along an axon and are chemically transduced across synapses (gray) into post-synatpic currents (black), whose amplitude reflects synaptic weights, thus converting digital signal back to analog. B. Frequency response function for dendrites (blue, adapted from [8]) and for the spike generation mechanism (green, adapted from [9]). Note one order of magnitude gap between the cut off frequencies. C. Amplitude of the summed postsynaptic currents depends strongly on spike timing. If the blue spike arrives just 5ms later, as shown in red, the EPSCs sum to a value already 20% less. Therefore, the extra precision of the digital signal may be used to communicate the amplitude of the analog signal. In signal processing, efficient AD conversion combines the principle of oversampling with that of noise-shaping, which utilizes correlations in the digital signal to allow more accurate encoding of the analog amplitude. This is exemplified by a family of AD converters called modulators [16], of which the basic one is analogous to an integrate-and-fire (IF) neuron [13-15]. The analogy between the basic modulator and the IF neuron led to the suggestion that neurons also use noise-shaping to encode incoming analog current waveform in the digital spike train [13]. However, the hypothesis of noise-shaping AD conversion has never been tested experimentally in biological neurons. In this paper, by analyzing existing experimental datasets, we demonstrate that noise-shaping is present in three different classes of neurons from vertebrates and invertebrates. This lends support to the view that neurons act as oversampling and noise-shaping AD converters and accounts for the mismatch between the slowly varying somatic currents and precise spike timing. Moreover, we show that the degree of noise-shaping in biological neurons exceeds that used by basic  modulators or IF neurons and propose viewing more complicated models in the noise-shaping framework. This paper is organized as follows: We review the principles of oversampling and noise-shaping in Section 2. In Section 3, we present experimental evidence for noise-shaping AD conversion in neurons. In Section 4 we argue that rectification of somatic currents may improve energy efficiency and/or implement de-noising. 2 . Oversampling and noise-shaping in AD converters To understand how oversampling can lead to more accurate encoding of the analog signal amplitude in a digital form, we first consider a Poisson spike encoder, whose rate of spiking is modulated by the signal amplitude, Figure 2A. Such an AD converter samples an analog signal at discrete time points and generates a spike with a probability given by the (normalized) signal amplitude. Because of the binary nature of spike trains, the resulting spike train encodes the signal with a large error even when the sampling is done at Nyquist rate, i.e. the lowest rate for alias-free sampling. To reduce the encoding error a Poisson encoder can sample at frequencies, fs , higher than Nyquist, fN – hence, the term oversampling, Figure 2B. When combined with decoding by lowpass filtering (down to Nyquist) on the receiving end, this leads to a reduction of the error, which can be estimated as follows. The number of samples over a Nyquist half-period (1/2fN) is given by the oversampling ratio: . As the normalized signal amplitude, , stays roughly constant over the Nyquist half-period, it can be encoded by spikes generated with a fixed probability, x. For a Poisson process the variance in the number of spikes is equal to the mean, . Therefore, the mean relative error of the signal decoded by averaging over the Nyquist half-period: , (1) indicating that oversampling reduces transmission error. However, the weak dependence of the error on the oversampling frequency indicates diminishing returns on the investment in oversampling and motivates one to search for other ways to lower the error. Figure 2. Oversampling and noise-shaping in AD conversion. A. Analog somatic current (blue) and its digital code (green). The difference between the green and the blue curves is encoding error. B. Digital output of oversampling Poisson encoder over one Nyquist half-period. C. Error power spectrum of a Nyquist (dark green) and oversampled (light green) Poisson encoder. Although the total error power is the same, the fraction surviving low-pass filtering during decoding (solid green) is smaller in oversampled case. D. Basic  modulator. E. Signal at the output of the integrator. F. Digital output of the  modulator over one Nyquist period. G. Error power spectrum of the  modulator (brown) is shifted to higher frequencies and low-pass filtered during decoding. The remaining error power (solid brown) is smaller than for Poisson encoder. To reduce encoding error beyond the ½ power of the oversampling ratio, the principle of noiseshaping was put forward [17]. To illustrate noise-shaping consider a basic AD converter called  [18], Figure 2D. In the basic  modulator, the previous quantized signal is fed back and subtracted from the incoming signal and then the difference is integrated in time. Rather than quantizing the input signal, as would be done in the Poisson encoder,  modulator quantizes the integral of the difference between the incoming analog signal and the previous quantized signal, Figure 2F. One can see that, in the oversampling regime, the quantization error of the basic  modulator is significantly less than that of the Poisson encoder. As the variance in the number of spikes over the Nyquist period is less than one, the mean relative error of the signal is at most, , which is better than the Poisson encoder. To gain additional insight and understand the origin of the term noise-shaping, we repeat the above analysis in the Fourier domain. First, the Poisson encoder has a flat power spectrum up to the sampling frequency, Figure 2C. Oversampling preserves the total error power but extends the frequency range resulting in the lower error power below Nyquist. Second, a more detailed analysis of the basic  modulator, where the dynamics is linearized by replacing the quantization device with a random noise injection [19], shows that the quantization noise is effectively differentiated. Taking the derivative in time is equivalent to multiplying the power spectrum of the quantization noise by frequency squared. Such reduction of noise power at low frequencies is an example of noise shaping, Figure 2G. Under the additional assumption of the white quantization noise, such analysis yields: , (2) which for R >> 1 is significantly better performance than for the Poisson encoder, Eq.(1). As mentioned previously, the basic  modulator, Figure 2D, in the continuous-time regime is nothing other than an IF neuron [13, 20, 21]. In the IF neuron, quantization is implemented by the spike generation mechanism and the negative feedback corresponds to the after-spike reset. Note that resetting the integrator to zero is strictly equivalent to subtraction only for continuous-time operation. In discrete-time computer simulations, the integrator value may exceed the threshold, and, therefore, subtraction of the threshold value rather than reset must be used. Next, motivated by the -IF analogy, we look for the signs of noise-shaping AD conversion in real neurons. 3 . Experimental evidence of noise-shaping AD conversion in real neurons In order to determine whether noise-shaping AD conversion takes place in biological neurons, we analyzed three experimental datasets, where spike trains were generated by time-varying somatic currents: 1) rat somatosensory cortex L5 pyramidal neurons [9], 2) mouse olfactory mitral cells [22, 23], and 3) fruit fly olfactory receptor neurons [24]. In the first two datasets, the current was injected through an electrode in whole-cell patch clamp mode, while in the third, the recording was extracellular and the intrinsic somatic current could be measured because the glial compartment included only one active neuron. Testing the noise-shaping AD conversion hypothesis is complicated by the fact that encoded and decoded signals are hard to measure accurately. First, as somatic current is rectified by the spikegeneration mechanism, only its super-threshold component can be encoded faithfully making it hard to know exactly what is being encoded. Second, decoding in the dendrites is not accessible in these single-neuron recordings. In view of these difficulties, we start by simply computing the power spectrum of the reconstruction error obtained by subtracting a scaled and shifted, but otherwise unaltered, spike train from the somatic current. The scaling factor was determined by the total weight of the decoding linear filter and the shift was optimized to maximize information capacity, see below. At the frequencies below 20Hz the error contains significantly lower power than the input signal, Figure 3, indicating that the spike generation mechanism may be viewed as an AD converter. Furthermore, the error power spectrum of the biological neuron is below that of the Poisson encoder, thus indicating the presence of noise-shaping. For dataset 3 we also plot the error power spectrum of the IF neuron, the threshold of which is chosen to generate the same number of spikes as the biological neuron. 4 somatic current biological neuron error Poisson encoder error I&F; neuron error 10 1 10 0 Spectral power, a.u. Spectral power, a.u. 10 3 10 -1 10 -2 10 -3 10 2 10 -4 10 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 3. Evidence of noise-shaping. Power spectra of the somatic current (blue), difference between the somatic current and the digital spike train of the biological neuron (black), of the Poisson encoder (green) and of the IF neuron (red). Left: datset 1, right: dataset 3. Although the simple analysis presented above indicates noise-shaping, subtracting the spike train from the input signal, Figure 3, does not accurately quantify the error when decoding involves additional filtering. An example of such additional encoding/decoding is predictive coding, which will be discussed below [25]. To take such decoding filter into account, we computed a decoded waveform by convolving the spike train with the optimal linear filter, which predicts the somatic current from the spike train with the least mean squared error. Our linear decoding analysis lends additional support to the noise-shaping AD conversion hypothesis [13-15]. First, the optimal linear filter shape is similar to unitary post-synaptic currents, Figure 4B, thus supporting the view that dendrites reconstruct the somatic current of the presynaptic neuron by low-pass filtering the spike train in accordance with the noise-shaping principle [13]. Second, we found that linear decoding using an optimal filter accounts for 60-80% of the somatic current variance. Naturally, such prediction works better for neurons in suprathreshold regime, i.e. with high firing rates, an issue to which we return in Section 4. To avoid complications associated with rectification for now we focused on neurons which were in suprathreshold regime by monitoring that the relationship between predicted and actual current is close to linear. 2 10 C D 1 10 somatic current biological neuron error Poisson encoder error Spectral power, a.u. Spectral power, a.u. I&F; neuron error 3 10 0 10 -1 10 -2 10 -3 10 2 10 -4 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 10 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 4. Linear decoding of experimentally recorded spike trains. A. Waveform of somatic current (blue), resulting spike train (black), and the linearly decoded waveform (red) from dataset 1. B. Top: Optimal linear filter for the trace in A, is representative of other datasets as well. Bottom: Typical EPSPs have a shape similar to the decoding filter (adapted from [26]). C-D. Power spectra of the somatic current (blue), the decdoding error of the biological neuron (black), the Poisson encoder (green), and IF neuron (red) for dataset 1 (C) dataset 3 (D). Next, we analyzed the spectral distribution of the reconstruction error calculated by subtracting the decoded spike train, i.e. convolved with the computed optimal linear filter, from the somatic current. We found that at low frequencies the error power is significantly lower than in the input signal, Figure 4C,D. This observation confirms that signals below the dendritic cut-off frequency of 20-30Hz can be efficiently communicated using spike trains. To quantify the effect of noise-shaping we computed information capacity of different encoders: where S(f) and N(f) are the power spectra of the somatic current and encoding error correspondingly and the sum is computed only over the frequencies for which S(f) > N(f). Because the plots in Figure 4C,D use semi-logrithmic scale, the information capacity can be estimated from the area between a somatic current (blue) power spectrum and an error power spectrum. We find that the biological spike generation mechanism has higher information capacity than the Poisson encoder and IF neurons. Therefore, neurons act as AD converters with stronger noise-shaping than IF neurons. We now return to the predictive nature of the spike generation mechanism. Given the causal nature of the spike generation mechanism it is surprising that the optimal filters for all three datasets carry most of their weight following a spike, Figure 4B. This indicates that the spike generation mechanism is capable of making predictions, which are possible in these experiments because somatic currents are temporally correlated. We note that these observations make delay-free reconstruction of the signal possible, thus allowing fast operation of neural circuits [27]. The predictive nature of the encoder can be captured by a  modulator embedded in a predictive coding feedback loop [28], Figure 5A. We verified by simulation that such a nested architecture generates a similar optimal linear filter with most of its weight in the time following a spike, Figure 5A right. Of course such prediction is only possible for correlated inputs implying that the shape of the optimal linear filter depends on the statistics of the inputs. The role of predictive coding is to reduce the dynamic range of the signal that enters , thus avoiding overloading. A possible biological implementation for such integrating feedback could be Ca2+ 2+ concentration and Ca dependent potassium channels [25, 29]. Figure 5. Enhanced  modulators. A.  modulator combined with predictive coder. In such device, the optimal decoding filter computed for correlated inputs has most of its weight following a spike, similar to experimental measurements, Figure 4B. B. Second-order  modulator possesses stronger noise-shaping properties. Because such circuit contains an internal state variable it generates a non-periodic spike train in response to a constant input. Bottom trace shows a typical result of a simulation. Black – spikes, blue – input current. 4 . Possible reasons for current rectification: energy efficiency and de-noising We have shown that at high firing rates biological neurons encode somatic current into a linearly decodable spike train. However, at low firing rates linear decoding cannot faithfully reproduce the somatic current because of rectification in the spike generation mechanism. If the objective of spike generation is faithful AD conversion, why would such rectification exist? We see two potential reasons: energy efficiency and de-noising. It is widely believed that minimizing metabolic costs is an important consideration in brain design and operation [30, 31]. Moreover, spikes are known to consume a significant fraction of the metabolic budget [30, 32] placing a premium on their total number. Thus, we can postulate that neuronal spike trains find a trade-off between the mean squared error in the decoded spike train relative to the input signal and the total number of spikes, as expressed by the following cost function over a time interval T: , (3) where x is the analog input signal, s is the binary spike sequence composed of zeros and ones, and is the linear filter. To demonstrate how solving Eq.(3) would lead to thresholding, let us consider a simplified version taken over a Nyquist period, during which the input signal stays constant: (4) where and normalized by w. Minimizing such a cost function reduces to choosing the lowest lying parabola for a given , Figure 6A. Therefore, thresholding is a natural outcome of minimizing a cost function combining the decoding error and the energy cost, Eq.(3). In addition to energy efficiency, there may be a computational reason for thresholding somatic current in neurons. To illustrate this point, we note that the cost function in Eq. (3) for continuous variables, st, may be viewed as a non-negative version of the L1-norm regularized linear regression called LASSO [33], which is commonly used for de-noising of sparse and Laplacian signals [34]. Such cost function can be minimized by iteratively applying a gradient descent and a shrinkage steps [35], which is equivalent to thresholding (one-sided in case of non-negative variables), Figure 6B,C. Therefore, neurons may be encoding a de-noised input signal. Figure 6. Possible reasons for rectification in neurons. A. Cost function combining encoding error squared with metabolic expense vs. input signal for different values of the spike number N, Eq.(4). Note that the optimal number of spikes jumps from zero to one as a function of input. B. Estimating most probable “clean” signal value for continuous non-negative Laplacian signal and Gaussian noise, Eq.(3) (while setting w = 1). The parabolas (red) illustrate the quadratic loglikelihood term in (3) for different values of the measurement, s, while the linear function (blue) reflects the linear log-prior term in (3). C. The minimum of the combined cost function in B is at zero if s , and grows linearly with s, if s >. 5 . Di scu ssi on In this paper, we demonstrated that the neuronal spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter, which encodes a rectified low-pass filtered somatic current as a digital spike train. Rectification by the spike generation mechanism may subserve both energy efficiency and de-noising. As the degree of noise-shaping in biological neurons exceeds that in IF neurons, or basic , we suggest that neurons should be modeled by more advanced  modulators, e.g. Figure 5B. Interestingly,  modulators can be also viewed as coders with error prediction feedback [19]. Many publications studied various aspects of spike generation in neurons yet we believe that the framework [13-15] we adopt is different and discuss its relationship to some of the studies. Our framework is different from previous proposals to cast neurons as predictors [36, 37] because a different quantity is being predicted. The possibility of perfect decoding from a spike train with infinite temporal precision has been proven in [38]. Here, we are concerned with a more practical issue of how reconstruction error scales with the over-sampling ratio. Also, we consider linear decoding which sets our work apart from [39]. Finally, previous experiments addressing noiseshaping [40] studied the power spectrum of the spike train rather than that of the encoding error. Our work is aimed at understanding biological and computational principles of spike-generation and decoding and is not meant as a substitute for the existing phenomenological spike-generation models [41], which allow efficient fitting of parameters and prediction of spike trains [42]. Yet, the theoretical framework [13-15] we adopt may assist in building better models of spike generation for a given somatic current waveform. First, having interpreted spike generation as AD conversion, we can draw on the rich experience in signal processing to attack the problem. Second, this framework suggests a natural metric to compare the performance of different spike generation models in the high firing rate regime: a mean squared error between the injected current waveform and the filtered version of the spike train produced by a model provided the total number of spikes is the same as in the experimental data. The AD conversion framework adds justification to the previously proposed spike distance obtained by subtracting low-pass filtered spike trains [43]. As the framework [13-15] we adopt relies on viewing neuronal computation as an analog-digital hybrid, which requires AD and DA conversion at every step, one may wonder about the reason for such a hybrid scheme. Starting with the early days of computers, the analog mode is known to be advantageous for computation. For example, performing addition of many variables in one step is possible in the analog mode simply by Kirchhoff law, but would require hundreds of logical gates in the digital mode [44]. However, the analog mode is vulnerable to noise build-up over many stages of computation and is inferior in precisely communicating information over long distances under limited energy budget [30, 31]. While early analog computers were displaced by their digital counterparts, evolution combined analog and digital modes into a computational hybrid [44], thus necessitating efficient AD and DA conversion, which was the focus of the present study. We are grateful to L. Abbott, S. Druckmann, D. Golomb, T. Hu, J. Magee, N. Spruston, B. Theilman for helpful discussions and comments on the manuscript, to X.-J. Wang, D. McCormick, K. Nagel, R. Wilson, K. Padmanabhan, N. Urban, S. Tripathy, H. Koendgen, and M. Giugliano for sharing their data. The work of D.S. was partially supported by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI). R e f e re n c e s 1. Ferster, D. and N. Spruston, Cracking the neural code. Science, 1995. 270: p. 756-7. 2. Panzeri, S., et al., Sensory neural codes using multiplexed temporal scales. Trends Neurosci, 2010. 33(3): p. 111-20. 3. Stevens, C.F. and A. Zador, Neural coding: The enigma of the brain. Curr Biol, 1995. 5(12): p. 1370-1. 4. Shadlen, M.N. and W.T. Newsome, The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci, 1998. 18(10): p. 3870-96. 5. Shadlen, M.N. and W.T. Newsome, Noise, neural codes and cortical organization. Curr Opin Neurobiol, 1994. 4(4): p. 569-79. 6. Singer, W. and C.M. Gray, Visual feature integration and the temporal correlation hypothesis. Annu Rev Neurosci, 1995. 18: p. 555-86. 7. Meister, M., Multineuronal codes in retinal signaling. Proc Natl Acad Sci U S A, 1996. 93(2): p. 609-14. 8. Cook, E.P., et al., Dendrite-to-soma input/output function of continuous timevarying signals in hippocampal CA1 pyramidal neurons. J Neurophysiol, 2007. 98(5): p. 2943-55. 9. Kondgen, H., et al., The dynamical response properties of neocortical neurons to temporally modulated noisy inputs in vitro. Cereb Cortex, 2008. 18(9): p. 2086-97. 10. Tchumatchenko, T., et al., Ultrafast population encoding by cortical neurons. J Neurosci, 2011. 31(34): p. 12171-9. 11. Mainen, Z.F. and T.J. Sejnowski, Reliability of spike timing in neocortical neurons. Science, 1995. 268(5216): p. 1503-6. 12. Mar, D.J., et al., Noise shaping in populations of coupled model neurons. Proc Natl Acad Sci U S A, 1999. 96(18): p. 10450-5. 13. Shin, J., Adaptive noise shaping neural spike encoding and decoding. Neurocomputing, 2001. 38-40: p. 369-381. 14. Shin, J., The noise shaping neural coding hypothesis: a brief history and physiological implications. Neurocomputing, 2002. 44: p. 167-175. 15. Shin, J.H., Adaptation in spiking neurons based on the noise shaping neural coding hypothesis. Neural Networks, 2001. 14(6-7): p. 907-919. 16. Schreier, R. and G.C. Temes, Understanding delta-sigma data converters2005, Piscataway, NJ: IEEE Press, Wiley. xii, 446 p. 17. Candy, J.C., A use of limit cycle oscillations to obtain robust analog-to-digital converters. IEEE Trans. Commun, 1974. COM-22: p. 298-305. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. Inose, H., Y. Yasuda, and J. Murakami, A telemetring system code modulation -  modulation. IRE Trans. Space Elect. Telemetry, 1962. SET-8: p. 204-209. Spang, H.A. and P.M. Schultheiss, Reduction of quantizing noise by use of feedback. IRE TRans. Commun. Sys., 1962: p. 373-380. Hovin, M., et al., Delta-Sigma modulation in single neurons, in IEEE International Symposium on Circuits and Systems2002. Cheung, K.F. and P.Y.H. Tang, Sigma-Delta Modulation Neural Networks. Proc. IEEE Int Conf Neural Networkds, 1993: p. 489-493. Padmanabhan, K. and N. Urban, Intrinsic biophysical diversity decorelates neuronal firing while increasing information content. Nat Neurosci, 2010. 13: p. 1276-82. Urban, N. and S. Tripathy, Neuroscience: Circuits drive cell diversity. Nature, 2012. 488(7411): p. 289-90. Nagel, K.I. and R.I. Wilson, personal communication. Shin, J., C. Koch, and R. Douglas, Adaptive neural coding dependent on the timevarying statistics of the somatic input current. Neural Comp, 1999. 11: p. 1893-913. Magee, J.C. and E.P. Cook, Somatic EPSP amplitude is independent of synapse location in hippocampal pyramidal neurons. Nat Neurosci, 2000. 3(9): p. 895-903. Thorpe, S., D. Fize, and C. Marlot, Speed of processing in the human visual system. Nature, 1996. 381(6582): p. 520-2. Tewksbury, S.K. and R.W. Hallock, Oversample, linear predictive and noiseshaping coders of order N>1. IEEE Trans Circuits & Sys, 1978. CAS25: p. 436-47. Wang, X.J., et al., Adaptation and temporal decorrelation by single neurons in the primary visual cortex. J Neurophysiol, 2003. 89(6): p. 3279-93. Attwell, D. and S.B. Laughlin, An energy budget for signaling in the grey matter of the brain. J Cereb Blood Flow Metab, 2001. 21(10): p. 1133-45. Laughlin, S.B. and T.J. Sejnowski, Communication in neuronal networks. Science, 2003. 301(5641): p. 1870-4. Lennie, P., The cost of cortical computation. Curr Biol, 2003. 13(6): p. 493-7. Tibshirani, R., Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 1996. 58(1): p. 267-288. Chen, S.S.B., D.L. Donoho, and M.A. Saunders, Atomic decomposition by basis pursuit. Siam Journal on Scientific Computing, 1998. 20(1): p. 33-61. Elad, M., et al., Wide-angle view at iterated shrinkage algorithms. P SOc Photo-Opt Ins, 2007. 6701: p. 70102. Deneve, S., Bayesian spiking neurons I: inference. Neural Comp, 2008. 20: p. 91. Yu, A.J., Optimal Change-Detection and Spinking Neurons, in NIPS, B. Scholkopf, J. Platt, and T. Hofmann, Editors. 2006. Lazar, A. and L. Toth, Perfect Recovery and Sensitivity Analysis of Time Encoded Bandlimited Signals. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, 2004. 51(10). Pfister, J.P., P. Dayan, and M. Lengyel, Synapses with short-term plasticity are optimal estimators of presynaptic membrane potentials. Nat Neurosci, 2010. 13(10): p. 1271-5. Chacron, M.J., et al., Experimental and theoretical demonstration of noise shaping by interspike interval correlations. Fluctuations and Noise in Biological, Biophysical, and Biomedical Systems III, 2005. 5841: p. 150-163. Pillow, J., Likelihood-based approaches to modeling the neural code, in Bayesian Brain: Probabilistic Approaches to Neural Coding, K. Doya, et al., Editors. 2007, MIT Press. Jolivet, R., et al., A benchmark test for a quantitative assessment of simple neuron models. J Neurosci Methods, 2008. 169(2): p. 417-24. van Rossum, M.C., A novel spike distance. Neural Comput, 2001. 13(4): p. 751-63. Sarpeshkar, R., Analog versus digital: extrapolating from electronics to neurobiology. Neural Computation, 1998. 10(7): p. 1601-38.

6 0.67010468 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

7 0.65137625 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

8 0.64526135 345 nips-2012-Topic-Partitioned Multinetwork Embeddings

9 0.63016105 362 nips-2012-Waveform Driven Plasticity in BiFeO3 Memristive Devices: Model and Implementation

10 0.62836283 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity

11 0.61159611 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

12 0.60587603 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

13 0.59754014 12 nips-2012-A Neural Autoregressive Topic Model

14 0.57434273 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

15 0.57276791 195 nips-2012-Learning visual motion in recurrent neural networks

16 0.55876791 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

17 0.55793613 322 nips-2012-Spiking and saturating dendrites differentially expand single neuron computation capacity

18 0.55595958 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

19 0.55557978 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

20 0.54641098 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.066), (17, 0.033), (21, 0.054), (38, 0.108), (42, 0.033), (54, 0.026), (55, 0.039), (61, 0.013), (74, 0.053), (76, 0.1), (77, 0.237), (80, 0.106), (92, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84313202 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

Author: Pieter-jan Kindermans, Hannes Verschore, David Verstraeten, Benjamin Schrauwen

Abstract: The usability of Brain Computer Interfaces (BCI) based on the P300 speller is severely hindered by the need for long training times and many repetitions of the same stimulus. In this contribution we introduce a set of unsupervised hierarchical probabilistic models that tackle both problems simultaneously by incorporating prior knowledge from two sources: information from other training subjects (through transfer learning) and information about the words being spelled (through language models). We show, that due to this prior knowledge, the performance of the unsupervised models parallels and in some cases even surpasses that of supervised models, while eliminating the tedious training session. 1

same-paper 2 0.79545915 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

Author: Jeff Beck, Alexandre Pouget, Katherine A. Heller

Abstract: Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efficient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be naturally implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identification of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classification, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb’s rule and describe an extension of this work which allows us to deal with time varying and correlated latent causes. 1 Introduction to Probabilistic Inference in Cortex Probabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for dealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that behavioural experiments reliably demonstrate that humans and animals behave in a manner consistent with optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor [4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents probability distribution functions of task relevant stimuli rather than just single values. While there 1 are many ways to represent functions, Bayes rule tells us that when it comes to probability distribution functions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states that any pattern of activity, r, that efficiently represents a probability distribution over some task relevant quantity s, must satisfy the relationship p(s|r) ∝ p(r|s)p(s), where p(r|s) is the stimulus conditioned likelihood function that specifies the form of neural variability, p(s) gives the prior belief regarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given the representation r . Of course, it is unlikely that the nervous system consistently achieves this level of optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as characterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning machine such as the brain. The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). When this is the case, the precise form of the neural variability informs the nature of the neural code. For example, the exponential family of statistical models with linear sufficient statistics has been shown to be flexible enough to model the first and second order statistics of in vivo recordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i.e. log p(s|r) = h(s) · r − log Z(r) (1) Here, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard dot product, and Z(r) is the partition function which serves to normalize the posterior. This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. This includes, linear PPC implementations of cue combination[8], evidence integration over time, maximum likelihood and maximum a posterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman filtering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. Unfortunately, this research program has been a piecemeal effort that has largely proceeded by building neural networks designed deal with particular problems. As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. In this work, we will partially address this short coming by showing that Variation Bayesian Expectation Maximization (VBEM) algorithm provides a general scheme for approximate inference and learning with linear PPCs. In section 2, we briefly review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. Because this section describes the VB-PPC approach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. This result suggests that this particular non-linearity may be involved in very general and fundamental probabilistic computation, rather than simply playing a role in gain modulation. In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term fluctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. Properly implemented it can result a fast alternative to sampling based methods of inference such as MCMC[19] sampling. Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which specifies the joint distribution p(X, Θ, Z). This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i.e. we wish to compute, p(Z|X) ∝ p(X, Θ, Z)dΘ (2) When the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms simplify this marginalization by approximating the complex joint distribution over behaviourally relevant latents and parameters, p(Θ, Z|X), with a distribution q(Θ, Z) for which integrals of this form are easier to deal with in some sense. There is some art to choosing the particular form for the approximating distribution to make the above integral tractable, however, a factorized approximation is common, i.e. q(Θ, Z) = qΘ (Θ)qZ (Z). Regardless, for any given observation X, the approximate posterior is found by minimizing the Kullback-Leibler divergence between q(Θ, Z) and p(Θ, Z|X). When a factorized posterior is assumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm finds a local minimum of the KL divergence by iteratively updating, qΘ (Θ) and qZ (Z) according to the scheme n log qΘ (Θ) ∼ log p(X, Θ, Z) n qZ (Z) and n+1 log qZ (Z) ∼ log p(X, Θ, Z) n qΘ (Θ) (3) Here the brackets indicate an expected value taken with respect to the subscripted probability distribution function and the tilde indicates equality up to a constant which is independent of Θ and Z. The key property to note here is that the approximate posterior which results from this procedure is in an exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows for the straightforward construction of networks which implement the VBEM algorithm with linear PPC’s in the following way. If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . Z (4) Here the stimulus dependent kernels hZ (Z) and hΘ (Θ) are chosen so that their outer product results in a basis that spans the function space on Z × Θ given by log p(X, Θ, Z) for every X. This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq. 5 until convergence will find the VBEM approximation to the posterior. This is one way to build a neural network implementation of the VB algorithm. However, its not the only way. In general, any dynamical system which has stable fixed points in common with Eq. 5 can also be said to implement the VBEM algorithm. In the example below we will take advantage of this flexibility in order to build biologically plausible neural network implementations. 3 Response! to Mixture ! of Odors! Single Odor Response Cause Intensity Figure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identification depicted in Fig. 1. A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. This results in a highly distributed code for each odor. Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. A simple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA) [18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. It also assumes that there are K topics and that each of these topics appears in different proportions in each document, e.g. 80% of the words in a document might be concerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn from a distribution over words associated with that topic, e.g. when talking about coffee you have a 5% chance of using the word ’bitter’. The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. We can map the generative model for LDA onto the task of spike demixing in cortex by letting topics become latent causes or odors, words become neurons, word occurrences become spikes, word distributions associated with each topic become patterns of neural activity associated with each cause, and different documents become the observed patterns of neural activity on different trials. This equivalence is made explicit in Fig. 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. For each topic k = 1, . . . , K, (a) Distribution over words βk ∼ Dirichlet(η0 ) 2. For document d = 1, . . . , D, (a) Distribution over topics θd ∼ Dirichlet(α0 ) (b) For word m = 1, . . . , Ωd i. Topic assignment zd,m ∼ Multinomial(θd ) ii. Word assignment ωd,m ∼ Multinomial(βzm ) 1. For latent cause k = 1, . . . , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. For scene d = 1, . . . , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . . . , Ωd i. Cause assignment zd,m ∼ Multinomial(θd ) ii. Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. Text related attributes on the left, in red, have been replaced with neural attributes on the right, in green. rather than topic assignments. Specifically, we define, Rd,j to be the number of times neuron j fires during trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). (6) is equivalent to the topic models described above. Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. Note that, the problem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to that considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to infer a probability distribution over these quantities rather than maximum likelihood estimates. See supplement for details. Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . . . , D, with a factorized distribution of the form, qN (N)qc (c)qβ (β). This results in marginal posterior distributions q (β:,k |η:,k ), q cd,k |αd,k , C −1 + 1 ), and q (Nd,j,: | log pd,j,: , Rd,i ) which are Dirichlet, Gamma, and Multinomial respectively. Here, the parameters η:,k , αd,k , and log pd,j,: are the natural parameters of these distributions. The VBEM update algorithm yields update rules for these parameters which are summarized in Fig. 3 Algorithm1. Algorithm 1: Batch VB updates 1: while ηj,k not converged do 2: for d = 1, · · · , D do 3: while pd,j,k , αd,k not converged do 4: αd,k → α0 + j Rd,j pd,j,k 5: pd,j,k → Algorithm 2: Online VB updates 1: for d = 1, · · · , D do 2: reinitialize pj,k , αk ∀j, k 3: while pj,k , αk not converged do 4: αk → α0 + j Rd,j pj,k 5: pj,k → exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αk ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αi ) exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αd,k ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αd,i ) 6: end while 7: end for 8: ηj,k = η 0 + 9: end while end while ηj,k → (1 − dt)ηj,k + dt(η 0 + Rd,j pj,k ) 8: end for 6: 7: d Rd,j pd,j,k Figure 3: Here ηk = j ηj,k and ψ(x) is the digamma function so that exp ψ(x) is a smoothed ¯ threshold linear function. Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. Fortunately, an online version of this variational algorithm was recently proposed and shown to give 5 superior results when compared to the batch learning algorithm[21]. This algorithm replaces the sum over d in update equation for ηj,k with an incremental update based upon only the most recently observed pattern of spikes. See Fig. 3 Algorithm 2. 4.1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. One way to accomplish this is to simply assume that the firing rates of output neurons are equal to the positive valued αk parameters. Fig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. A continuous time dynamical system with this feature and the property that it shares its fixed points with the LDA algorithm is given by d ¯ Nj,k dt d αk dt ¯ ¯ = wj,k Rj − Rj Nj,k = (7) ¯ Nj,k exp (ψ (¯k )) (α0 − αk ) + exp (ψ (αk )) η (8) i ¯ where Rj = k wj,k exp (ψ (αk )), and wj,k = exp (ψ (ηj,k )). Note that, despite its form, it is Eq. 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . Regardless, this network has a variety of interesting properties that align well with biology. It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. It also predicts superlinear facilitation. Specifically, the final term on the right of Eq. 8 indicates that more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be implemented via recurrent excitation at the population level. In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally, the ηj in Eq. 8 represents a cell wide ’leak’ parameter that indicates that the total leak should be ¯ roughly proportional to the sum total weight of the synapses which drive the neuron. This predicts that cells that are highly sensitive to input should also decay back to baseline more quickly. This implementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the online update rule for the ηj,k parameters can be implemented by simply correlating the activity at ¯ each synapse, Nj,k with activity at the soma αj via the equation: τL d ¯ wj,k = exp (ψ (¯k )) (η0 − 1/2 − wj,k ) + Nj,k exp ψ (αk ) η dt (9) where τL is a long time constant for learning and we have used the fact that exp (ψ (ηjk )) ≈ ηjk −1/2 for x > 1. For a detailed derivation see the supplementary material. 5 Dynamic Document Model LDA is a rather simple generative model that makes several unrealistic assumptions about mixtures of sensory and cortical spikes. In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). We call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model smooth changes in the distribution of topics over the course of a single document. 5.1 DDM Model Thus the generative model for the DDM is as follows: 1. For latent cause k = 1, . . . , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. For scene t = 1, . . . , T , (a) Log intensity of causes c(t) ∼ Normal(Λct−1 , ΣD ) (b) Number of spikes in neuron j resulting from cause k, Nj,k (t) ∼ Poisson(βj,k exp ck (t)) (c) Number of spikes in neuron j, Rj (t) = k Nj,k (t) This model bears many similarities to the Correlated and Dynamic topic models[22], but models dynamics over a short time scale, where the dynamic relationship (Λ, ΣD ) is important. 5.2 Network Implementation Once again the quantity of interest is the current distribution of latent causes, p(c(t)|R(τ ), τ = 0..T ). If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman filter with parameters (Λ, ΣD ). A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman filter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. Thus, the patterns of neural activity which linearly encode them must do so as well. When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. All that is required is an additional inhibitory population which targets the somata in the output population. See Fig. 5. 7 Natural Parameters Natural Parameters (α) 0.4 200 450 180 0.3 Network Estimate Network Estimate 500 400 350 300 250 200 150 100 0.1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 140 120 0.4 0.3 100 0.2 80 0.1 0 60 40 0.4 20 50 0 0 0.2 160 0 0 0.3 0.2 20 40 60 80 100 120 VBEM Estimate VBEM Estimate 140 160 180 200 0.1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the matrix Σ−1 (t) and Σ−1 µ(t) of the distribution over log topic intensities. (Right) Three example traces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable) show MAP estimates for the network and VBEM algorithms. 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same fixed points as the standard VBEM algorithm. As a result, it is possible for the two algorithms to converge to different local minima of the KL divergence. For the network implementation of LDA we find good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). This agreement, however, is not perfect, especially when posterior predicted concentrations are low. In part, this is due to the fact we are presenting the network with difficult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. As a result, the objective function (KL divergence) is littered with local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. See Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for the VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. For simulation details please see the supplement. 7 Discussion and Conclusion In this work we presented a general framework for inference and learning with linear Probabilistic Population codes. This framework takes advantage of the fact that the Variational Bayesian Expectation Maximization algorithm generates approximate posterior distributions which are in an exponential family form. This is precisely the form needed in order to make probability distributions representable by a linear PPC. We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. We chose this problem because it has many of the features of the canonical problem faced by nearly every layer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer below. Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. 8 References [1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual review of psychology, 55:271–304, January 2004. [2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, 2002. [3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature neuroscience, 5(6):598–604, 2002. [4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6): 740–6, 2000. o [5] Konrad P K¨ rding and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427 (6971):244–7, 2004. [6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–15, 2004. [7] Erno T´ gl´ s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. e a Pure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033): 1054–9, 2011. [8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nature Neuroscience, 2006. [9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N Shadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision making. Neuron, 60(6):1142–52, 2008. [10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization. Journal of Neuroscience, 31(43):15310–15319, 2011. [11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075–80, 2007. [12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye movements. Nature, 1995. [13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–45, 2011. [14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy of Sciences, 1969. [15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior and neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011. [16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992. [17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 17(21):8621–44, 1997. [18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003. [19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL, 2003. [20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788–91, 1999. [21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010. [22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006. [23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational biology, 2011. 9

3 0.78493428 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

Abstract: We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We report experiments using real-world, multi-subject fMRI data. 1

4 0.73370278 65 nips-2012-Cardinality Restricted Boltzmann Machines

Author: Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S. Zemel, Ruslan Salakhutdinov, Ryan P. Adams

Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneficial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM’s hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers. 1

5 0.70494354 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes

Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-field based bounds. 1

6 0.63902074 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

7 0.63754237 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

8 0.6367327 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

9 0.63600546 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

10 0.63404655 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

11 0.63288206 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

12 0.63193113 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

13 0.63161075 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

14 0.63037217 190 nips-2012-Learning optimal spike-based representations

15 0.62933517 23 nips-2012-A lattice filter model of the visual pathway

16 0.62913954 197 nips-2012-Learning with Recursive Perceptual Representations

17 0.62909585 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

18 0.62819755 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

19 0.62716079 168 nips-2012-Kernel Latent SVM for Visual Recognition

20 0.62696511 200 nips-2012-Local Supervised Learning through Space Partitioning