nips nips2004 nips2004-112 nips2004-112-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anthony J. Bell, Lucas C. Parra
Abstract: We use unsupervised probabilistic machine learning ideas to try to explain the kinds of learning observed in real neurons, the goal being to connect abstract principles of self-organisation to known biophysical processes. For example, we would like to explain Spike TimingDependent Plasticity (see [5,6] and Figure 3A), in terms of information theory. Starting out, we explore the optimisation of a network sensitivity measure related to maximising the mutual information between input spike timings and output spike timings. Our derivations are analogous to those in ICA, except that the sensitivity of output timings to input timings is maximised, rather than the sensitivity of output ‘firing rates’ to inputs. ICA and related approaches have been successful in explaining the learning of many properties of early visual receptive fields in rate coding models, and we are hoping for similar gains in understanding of spike coding in networks, and how this is supported, in principled probabilistic ways, by cellular biophysical processes. For now, in our initial simulations, we show that our derived rule can learn synaptic weights which can unmix, or demultiplex, mixed spike trains. That is, it can recover independent point processes embedded in distributed correlated input spike trains, using an adaptive single-layer feedforward spiking network. 1 Maximising Sensitivity. In this section, we will follow the structure of the ICA derivation [4] in developing the spiking theory. We cannot claim, as before, that this gives us an information maximisation algorithm, for reasons that we will delay addressing until Section 3. But for now, to first develop our approach, we will explore an interim objective function called sensitivity which we define as the log Jacobian of how input spike timings affect output spike timings. 1.1 How to maximise the effect of one spike timing on another. Consider a spike in neuron j at time tl that has an effect on the timing of another spike in neuron i at time tk . The neurons are connected by a weight wij . We use i and j to index neurons, and k and l to index spikes, but sometimes for convenience we will use spike indices in place of neuron indices. For example, wkl , the weight between an input spike l and an output spike k, is naturally understood to be just the corresponding wij . dtk dtl threshold potential du u(t) R(t) resting potential tk output spikes tl input spikes Figure 1: Firing time tk is determined by the time of threshold crossing. A change of an input spike time dtl affects, via a change of the membrane potential du the time of the output spike by dtk . In the simplest version of the Spike Response Model [7], spike l has an effect on spike k that depends on the time-course of the evoked EPSP or IPSP, which we write as R kl (tk − tl ). In general, this Rkl models both synaptic and dendritic linear responses to an input spike, and thus models synapse type and location. For learning, we need only consider the value of this function when an output spike, k, occurs. In this model, depicted in Figure 1, a neuron adds up its spiking inputs until its membrane potential, ui (t), reaches threshold at time tk . This threshold we will often, again for convenience, write as uk ≡ ui (tk , {tl }), and it is given by a sum over spikes l: uk = wkl Rkl (tk − tl ) . (1) l To maximise timing sensitivity, we need to determine the effect of a small change in the input firing time tl on the output firing time tk . (A related problem is tackled in [2].) When tl is changed by a small amount dtl the membrane potential will change as a result. This change in the membrane potential leads to a change in the time of threshold crossing dt k . The contribution to the membrane potential, du, due to dtl is (∂uk /∂tl )dtl , and the change in du corresponding to a change dtk is (∂uk /∂tk )dtk . We can relate these two effects by noting that the total change of the membrane potential du has to vanish because u k is defined as the potential at threshold. ie: du = ∂uk ∂uk dtk + dtl = 0 . ∂tk ∂tl (2) This is the total differential of the function uk = u(tk , {tl }), and is a special case of the implicit function theorem. Rearranging this: dtk ∂uk =− dtl ∂tl ∂uk ˙ = −wkl Rkl /uk . ˙ ∂tk (3) Now, to connect with the standard ICA derivation [4], recall the ‘rate’ (or sigmoidal) neuron, for which yi = gi (ui ) and ui = j wij xj . For this neuron, the output dependence on input is ∂yi /∂xj = wij gi while the learning gradient is: ∂yi ∂ 1 log − fi (ui )xj = ∂wij ∂xj wij (4) where the ‘score functions’, fi , are defined in terms of a density estimate on the summed ∂ ∂ inputs: fi (ui ) = ∂ui log gi = ∂ui log p(ui ). ˆ The analogous learning gradient for the spiking case, from (3), is: ˙ j(a)Rka ∂ dtk 1 log − a . = ∂wij dtl wij uk ˙ (5) where j(a) = 1 if spike a came from neuron j, and 0 otherwise. Comparing the two cases in (4) and (5), we see that the input variable xj has become the temporal derivative of the sum of the EPSPs coming from synapse j, and the output variable (or score function) fi (ui ) has become u−1 , the inverse of the temporal derivative ˙k of the membrane potential at threshold. It is intriguing (A) to see this quantity appear as analogous to the score function in the ICA likelihood model, and, (B) to speculate that experiments could show that this‘ voltage slope at threshold’ is a hidden factor in STDP data, explaining some of the scatter in Figure 3A. In other words, an STDP datapoint should lie on a 2-surface in a 3D space of {∆w, ∆t, uk }. Incidentally, uk shows up in any ˙ ˙ learning rule optimising an objective function involving output spike timings. 1.2 How to maximise the effect of N spike timings on N other ones. Now we deal with the case of a ‘square’ single-layer feedforward mapping between spike timings. There can be several input and output neurons, but here we ignore which neurons are spiking, and just look at how the input timings affect the output timings. This is captured in a Jacobian matrix of all timing dependencies we call T. The entries of this matrix are Tkl ≡ ∂tk /∂tl . A multivariate version of the sensitivity measure introduced in the previous section is the log of the absolute determinant of the timing matrix, ie: log |T|. The full derivation for the gradient W log |T| is in the Appendix. Here, we again draw out the analogy between Square ICA [4] and this gradient, as follows. Square ICA with a network y = g(Wx) is: ∆W ∝ W log |J| = W−1 − f (u)xT (6) where the Jacobian J has entries ∂yi /∂xj and the score functions are now, fi (u) = ∂ − ∂ui log p(u) for the general likelihood case, with p(u) = i gi being the special case of ˆ ˆ ICA. We will now split the gradient in (6) according to the chain rule: W log |J| = [ J log |J|] ⊗ [ W J] j(l) − fk (u)xj wkl J−T ⊗ Jkl i(k) = (7) . (8) In this equation, i(k) = δik and j(l) = δjl . The righthand term is a 4-tensor with entries ∂Jkl /∂wij , and ⊗ is defined as A ⊗ Bij = kl Akl Bklij . We write the gradient this way to preserve, in the second term, the independent structure of the 1 → 1 gradient term in (4), and to separate a difficult derivation into two easy parts. The structure of (8) holds up when we move to the spiking case, giving: W log |T| = = [ T log |T|] ⊗ [ W T] T−T ⊗ Tkl i(k) j(l) − wkl (9) a ˙ j(a)Rka uk ˙ (10) where i(k) is now defined as being 1 if spike k occured in neuron i, and 0 otherwise. j(l) and j(a) are analogously defined. Because the T matrix is much bigger than the J matrix, and because it’s entries are more complex, here the similarity ends. When (10) is evaluated for a single weight influencing a single spike coupling (see the Appendix for the full derivation), it yields: ∆wkl ∝ ∂ log |T| Tkl = ∂wkl wkl T−1 lk −1 , (11) This is a non-local update involving a matrix inverse at each step. In the ICA case of (6), such an inverse was removed by the Natural Gradient transform (see [1]), but in the spike timing case, this has turned out not to be possible, because of the additional asymmetry ˙ introduced into the T matrix (as opposed to the J matrix) by the Rkl term in (3). 2 Results. Nonetheless, this learning rule can be simulated. It requires running the network for a while to generate spikes (and a corresponding T matrix), and then for each input/output spike coupling, the corresponding synapse is updated according to (11). When this is done, and the weights learn, it is clear that something has been sacrificed by ignoring the issue of which neurons are producing the spikes. Specifically, the network will often put all the output spikes on one output neuron, with the rates of the others falling to zero. It is happy to do this, if a large log |T| can thereby be achieved, because we have not included this ‘which neuron’ information in the objective. We will address these and other problems in Section 3, but now we report on our simulation results on demultiplexing. 2.1 Demultiplexing spike trains. An interesting possibility in the brain is that ‘patterns’ are embedded in spatially distributed spike timings that are input to neurons. Several patterns could be embedded in single input trains. This is called multiplexing. To extract and propagate these patterns, the neurons must demultiplex these inputs using its threshold nonlinearity. Demultiplexing is the ‘point process’ analog of the unmixing of independent inputs in ICA. We have been able to robustly achieve demultiplexing, as we now report. We simulated a feed-forward network with 3 integrate-and-fire neurons and inputs from 3 presynaptic neurons. Learning followed (11) where we replace the inverse by the pseudoinverse computed on the spikes generated during 0.5 s. The pseudo-inverse is necessary because even though on average, the learning matches number of output spikes to number of input spikes, the matrix T is still not usually square and so its actual inverse cannot be taken. In addition, in these simulations, an additional term is introduced in the learning to make sure all the output neurons fire with equal probability. This partially counters the ignoral of the ‘which neuron’ information, which we explained above. Assuming Poisson spike count ni for the ith output neuron with equal firing rate ni it is easy to derive in an approximate ¯ term that will control the spike count, i (¯ i − ni ). The target firing rates ni were set to n ¯ match the “source” spike train in this example. The network learns to demultiplex mixed spike trains, as shown in Figure 2. This demultiplexing is a robust property of learning using (11) with this new spike-controlling term. Finally, what about the spike-timing dependendence of the observed learning? Does it match experimental results? The comparison is made in Figure 3, and the answer is no. There is a timing-dependent transition between depression and potentiation in our result Spike Trains mixing mixed input trains 1 1 0.8 2 0.6 3 0 50 100 150 200 250 300 350 400 450 0.4 500 0.2 output 1 0 2 3 synaptic weights 0 50 100 150 200 250 300 350 400 450 500 original spike train 1 1 0.5 2 0 3 0 50 100 150 200 250 time in ms 300 350 400 450 500 −0.5 Figure 2: Unmixed spike trains. The input (top lef) are 3 spike trains which are a mixture of three independent Poison processes (bottom left). The network unmixes the spike train to approximately recover the original (center left). In this example 19 spikes correspond to the original with 4 deletion and 2 insertions. The two panels at the right show the mixing (top) and synaptic weight matrix after training (bottom). in Figure 3B, but it is not a sharp transition like the experimental result in Figure 3A. In addition, it does not transition at zero (ie: when tk − tl = 0), but at a time offset by the rise time of the EPSPs. In earlier experiments, in which we tranformed the gradient in (11) by an approximate inverse Hessian, to get an approximate Natural Gradient method, a sharp transition did emerge in simulations. However, the approximate inverse Hessian was singular, and we had to de-emphasise this result. It does suggest, however, that if the Natural Gradient transform can be usefully done on some variant of this learning rule, it may well be what accounts for the sharp transition effect of STDP. 3 Discussion Although these derivations started out smoothly, the reader possibly shares the authors’ frustration at the approximations involved here. Why isn’t this simple, like ICA? Why don’t we just have a nice maximum spikelihood model, ie: a density estimation algorithm for multivariate point processes, as ICA was a model in continuous space? We are going to be explicit about the problems now, and will propose a direction where the solution may lie. The over-riding problem is: we are unable to claim that in maximising log |T|, we are maximising the mutual information between inputs and outputs because: 1. The Invertability Problem. Algorithms such as ICA which maximise log Jacobians can only be called Infomax algorithms if the network transformation is both deterministic and invertable. The Spike Response Model is deterministic, but it is not invertable in general. When not invertable, the key formula (considering here vectors of input and output timings, tin and tout )is transformed from simple to complex. ie: p(tout ) = p(tin ) becomes p(tout ) = |T| solns tin p(tin ) d tin |T| (12) Thus when not invertable, we need to know the Jacobians of all the inputs that could have caused an output (called here ‘solns’), something we simply don’t know. 2. The ‘Which Neuron’ Problem. Instead of maximising the mutual information I(tout , tin ), we should be maximising I(tiout , tiin ), where the vector ti is the timing (A) STDP (B) Gradient 100 ∆ w (a.u.) 150 100 ∆ w / w (%) 150 50 0 −50 −100 −100 50 0 −50 −50 0 ∆ t (ms) 50 100 −100 −20 0 20 40 60 ∆ t (ms) 80 100 Figure 3: Dependence of synaptic modification on pre/post inter-spike interval. Left (A): From Froemke & Dan, Nature (2002)]. Dependence of synaptic modification on pre/post inter-spike interval in cat L2/3 visual cortical pyramidal cells in slice. Naturalistic spike trains. Each point represents one experiment. Right (B): According to Equation (11). Each point corresponds to an spike pair between approximately 100 input and 100 output spikes. vector, t, with the vector, i, of corresponding neuron indices, concatenated. Thus, ‘who spiked?’ should be included in the analysis as it is part of the information. 3. The Predictive Information Problem. In ICA, since there was no time involved, we did not have to worry about mutual informations over time between inputs and outputs. But in the spiking model, output spikes may well have (predictive) mutual information with future input spikes, as well as the usual (causal) mutual information with past input spikes. The former has been entirely missing from our analysis so far. These temporal and spatial infomation dependencies missing in our analysis so far, are thrown into a different light by a single empirical observation, which is that Spike TimingDependent Plasticity is not just a feedforward computation like the Spike Response Model. Specifically, there must be at least a statistical, if not a causal, relation between a real synapse’s plasticity and its neuron’s output spike timings, for Figure 3B to look like it does. It seems we have to confront the need for both a ‘memory’ (or reconstruction) model, such as the T we have thus far dealt with, in which output spikes talk about past inputs, and a ‘prediction’ model, in which they talk about future inputs. This is most easily understood from the point of view of Barber & Agakov’s variational Infomax algorithm [3]. They argue for optimising a lower bound on mutual information, which, for our neurons’, would be expressed using an inverse model p, as follows: ˆ I(tiin , tiout ) = H(tiin ) − log p(tiin |tiout ) ˆ p(tiin ,tiout ) ≤ I(tiin , tiout ) (13) In a feedforward model, H(tiin ) may be disregarded in taking gradients, leading us to the optimisation of a ‘memory-prediction’ model p(tiin |tiout ) related to something supposˆ edly happening in dendrites, somas and at synapses. In trying to guess what this might be, it would be nice if the math worked out. We need a square Jacobian matrix, T, so that |T| = p(tiin |tiout ) can be our memory/prediction model. Now let’s rename our feedforˆ → − ward timing Jacobian T (‘up the dendritic trees’), as T, and let’s fantasise that there is ← − some, as yet unspecified, feedback Jacobian T (‘down the dendritic trees’), which covers → − electrotonic influences as they spread from soma to synapse, and which T can be combined with by some operation ‘⊗’ to make things square. Imagine further, that doing this → ← − − yields a memory/prediction model on the inputs. Then the T we are looking for is T ⊗ T, → ← − − and the memory-prediction model is: p(tiin |tiout ) = T ⊗ T ˆ → − → − Ideally, the entries of T should be as before, ie: T kl = ∂tk /∂tl . What should the entries ← − ← − ← − of T be? Becoming just one step more concrete, suppose T had entries T lk = ∂cl /∂tk , where cl is some, as yet unspecified, value, or process, occuring at an input synapse when spike l comes in. What seems clear is that ⊗ should combine the correctly tensorised forms → − ← − → ← − − of T and T (giving them each 4 indices ijkl), so that T = T ⊗ T sums over the spikes k and l to give a I × J matrix, where I is the number of output neurons, and J the number of input neurons. Then our quantity, T, would represent all dependencies of input neuronal activity on output activity, summed over spikes. ← − Further, we imagine that T contains reverse (feedback) electrotonic transforms from soma ← − to synapse R lk that are somehow symmetrically related to the feedforward Spike Re→ − sponses from synapse to soma, which we now rename R kl . Thinking for a moment in terms of somatic k and synaptic l, voltages V , currents I and linear cable theory, the synapse to → − → − soma transform, R kl would be related to an impedance in Vk = Il Z kl , while the soma ← − ← − to synapse transform, R lk would be related to an admittance in Il = Vk Y lk [8]. The → − ← − symmetry in these equations is that Z kl is just the inverse conjugate of Y lk . Finally, then, what is cl ? And what is its relation to the calcium concentration, [Ca2+ ]l , at a synapse, when spike l comes in? These questions naturally follow from considering the experimental data, since it is known that the calcium level at synapses is the critical integrating factor in determining whether potentiation or depression occurs [5]. 4 Appendix: Gradient of log |T| for the full Spike Response Model. Here we give full details of the gradient for Gerstner’s Spike Response Model [7]. This is a general model for which Integrate-and-Fire is a special case. In this model the effect of a presynaptic spike at time tl on the membrane potential at time t is described by a post synaptic potential or spike response, which may also depend on the time that has passed since the last output spike tk−1 , hence the spike response is written as R(t − tk−1 , t − tl ). This response is weighted by the synaptic strength wl . Excitatory or inhibitory synapses are determined by the sign of wl . Refractoriness is incorporated by adding a hyper-polarizing contribution (spike-afterpotential) to the membrane potential in response to the last preceding spike η(t − tk−1 ). The membrane potential as a function of time is therefore given by u(t) = η(t − tk−1 ) + wl R(t − tk−1 , t − tl ) . (14) l We have ignored here potential contributions from external currents which can easily be included without modifying the following derivations. The output firing times t k are defined as the times for which u(t) reaches firing threshold from below. We consider a dynamic threshold, ϑ(t − tk−1 ), which may depend on the time since that last spike tk−1 , together then output spike times are defined implicitly by: t = tk : u(t) = ϑ(t − tk−1 ) and du(t) > 0. dt (15) For this more general model Tkl is given by Tkl = dtk =− dtl ∂u ∂ϑ − ∂tk ∂tk −1 ˙ ∂u wkl R(tk − tk−1 , tk − tl , ) = , ˙ ∂tl u(tk ) − ϑ(tk − tk−1 ) ˙ (16) ˙ ˙ where R(s, t), u(t), and ϑ(t) are derivatives with respect to t. The dependence of Tkl on ˙ tk−1 should be implicitly assumed. It has been omitted to simplify the notation. Now we compute the derivative of log |T| with respect to wkl . For any matrix T we have ∂ log |T|/∂Tab = [T−1 ]ba . Therefore: ∂ log |T| ∂Tab ∂ log |T| ∂Tab = [T−1 ]ba . (17) ∂wkl ∂Tab ∂wkl ∂wkl ab ab Utilising the Kronecker delta δab = (1 if a = b, else 0), the derivative of (16) with respect to wkl gives: ˙ ∂Tab ∂ wab R(ta − ta−1 , ta − tb ) = ˙ ˙ ∂wkl ∂wkl η(ta − ta−1 ) + wac R(ta − ta−1 , ta − tc ) − ϑ(ta − ta−1 ) c ˙ R(ta − ta−1 , ta − tb ) = δak δbl ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ ˙ ˙ wab R(ta − ta−1 , ta − tb )δak R(ta − ta−1 , ta − tl ) − 2 ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ = δak Tab Therefore: ∂ log |T| ∂wkl Tal δbl − wab wal . (18) δbl Tal − wab wal [T−1 ]ba δak Tab = ab = Tkl wkl [T−1 ]lk − [T−1 ]bk Tkl b (19) = Tkl [T−1 ]lk − 1 . wkl (20) Acknowledgments We are grateful for inspirational discussions with Nihat Ay, Michael Eisele, Hong Hui Yu, Jim Crutchfield, Jeff Beck, Surya Ganguli, Sophi` Deneve, David Barber, Fabian Theis, e Tony Zador and Arunava Banerjee. AJB thanks all RNI colleagues for many such discussions. References [1] Amari S-I. 1997. Natural gradient works efficiently in learning, Neural Computation, 10, 251-276 [2] Banerjee A. 2001. On the Phase-Space Dynamics of Systems of Spiking Neurons. Neural Computation, 13, 161-225 [3] Barber D. & Agakov F. 2003. The IM Algorithm: A Variational Approach to Information Maximization. Advances in Neural Information Processing Systems 16, MIT Press. [4] Bell A.J. & Sejnowski T.J. 1995. An information maximization approach to blind separation and blind deconvolution, Neural Computation, 7, 1129-1159 [5] Dan Y. & Poo M-m. 2004. Spike timing-dependent plasticity of neural circuits, Neuron, 44, 23-30 [6] Froemke R.C. & Dan Y. 2002. Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 28, 416: 433-8 [7] Gerstner W. & Kistner W.M. 2002. Spiking neuron models, Camb. Univ. Press [8] Zador A.M., Agmon-Snir H. & Segev I. 1995. The morphoelectrotonic transform: a graphical approach to dendritic function, J. Neurosci., 15(3): 1669-1682
[1] Amari S-I. 1997. Natural gradient works efficiently in learning, Neural Computation, 10, 251-276
[2] Banerjee A. 2001. On the Phase-Space Dynamics of Systems of Spiking Neurons. Neural Computation, 13, 161-225
[3] Barber D. & Agakov F. 2003. The IM Algorithm: A Variational Approach to Information Maximization. Advances in Neural Information Processing Systems 16, MIT Press.
[4] Bell A.J. & Sejnowski T.J. 1995. An information maximization approach to blind separation and blind deconvolution, Neural Computation, 7, 1129-1159
[5] Dan Y. & Poo M-m. 2004. Spike timing-dependent plasticity of neural circuits, Neuron, 44, 23-30
[6] Froemke R.C. & Dan Y. 2002. Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 28, 416: 433-8
[7] Gerstner W. & Kistner W.M. 2002. Spiking neuron models, Camb. Univ. Press
[8] Zador A.M., Agmon-Snir H. & Segev I. 1995. The morphoelectrotonic transform: a graphical approach to dendritic function, J. Neurosci., 15(3): 1669-1682