nips nips2004 nips2004-174 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Aharon Bar-hillel, Adam Spiro, Eran Stark
Abstract: Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. It is a complicated problem, which requires a lot of human labor, partly due to the non-stationary nature of the data. We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. At a first search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. Transition probabilities are computed using local stationarity assumptions and are based on a Gaussian version of the Jensen-Shannon divergence. The method was applied to several recordings. The performance appeared almost indistinguishable from humans in a wide range of scenarios, including movement, merges, and splits of clusters. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. [sent-10, score-0.78]
2 We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. [sent-12, score-0.281]
3 At a first search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. [sent-13, score-0.635]
4 At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. [sent-14, score-0.523]
5 1 Introduction Neural spike activity is recorded with a micro-electrode which normally picks up the activity of multiple neurons. [sent-18, score-0.329]
6 Spike sorting seeks the segmentation of the spike data such that each cluster contains all the spikes generated by a different neuron. [sent-19, score-0.646]
7 It is a tedious mission, requiring many hours of human labor for each recording session. [sent-21, score-0.204]
8 Several algorithms were proposed in order to help automating this process (see [7] for a review, [9],[10]) and some tools were implemented to assist in manual sorting [8]. [sent-22, score-0.418]
9 Other sources of non-stationarity include variable background noise and changes in the characteristic spike generated by a certain neuron. [sent-27, score-0.319]
10 Using the first 2 PCA coefficients to represent the data (which preserves up to 93% of the variance in the original recordings [1]), a human can cluster spikes by visual inspection. [sent-29, score-0.248]
11 2 and include: (1) Movements and considerable shape changes of the clusters over time, (2) Two clusters which are reasonably well-separated may move until they converge and become indistinguishable. [sent-32, score-0.27]
12 Most spike sorting algorithms do not address the presented difficulties at all, as they assume full stationarity of the data. [sent-34, score-0.564]
13 Some methods [4, 11] try to cope with the lack of stationarity by grouping data into many small clusters and identifying the clusters that can be combined to represent the activity of a single unit. [sent-35, score-0.36]
14 In [3] a semi-automated method is suggested, in which each time frame is clustered manually, and then the correspondence between clusters in consecutive time frames is established automatically. [sent-37, score-0.877]
15 The correspondence is determined by a heuristic score, and the algorithm doesn’t handle merge or split scenarios. [sent-38, score-0.243]
16 In this paper we suggest a new fully automated technique to solve the clustering problem for non-stationary Gaussian sources in a Bayesian framework. [sent-39, score-0.281]
17 We divide the data into short time frames in which stationarity is a reasonable assumption. [sent-40, score-0.366]
18 We then look for good mixture of Gaussians descriptions of the data in each time frame independently. [sent-41, score-0.541]
19 Transition probabilities between local mixture solutions are introduced, and a globally optimal clustering solution is computed by finding the Maximum-A-Posteriori (MAP) solution of the resulting probabilistic model. [sent-42, score-0.467]
20 The global optimization allows the algorithm to successfully disambiguate problematic time frames and exhibit close to human performance. [sent-43, score-0.315]
21 2 Clustering using a chain of Gaussian mixtures Denote the observable spike data by D = {d}, where each spike d ∈ Rn is described by the vector of its PCA coefficients. [sent-47, score-0.667]
22 We assume that in each frame, the data can be well approximated by i a mixture of Gaussians, where each Gaussian corresponds to a single neuron. [sent-49, score-0.212]
23 Each Gaussian in the mixture may have a different covariance matrix. [sent-50, score-0.253]
24 The number of components in the mixture is not known a priori, but is assumed to be within a certain range (we used 1-6). [sent-51, score-0.266]
25 In the search stage, we use a standard EM (Expectation-Maximization) algorithm to find a set of M t candidate mixture descriptions for each time frame t. [sent-52, score-0.607]
26 In a second step, we import to each time frame t the best mixture solutions found in the neighboring time frames [t − k, . [sent-55, score-0.81]
27 This mixing of solutions between time frames is repeated several times. [sent-59, score-0.35]
28 Finally, the solution list in each time frame is pruned to remove similar solutions. [sent-60, score-0.248]
29 In order to handle outliers, which are usually background spikes or non-spike events, each mixture candidate contains an additional ’background model’ Gaussian. [sent-62, score-0.362]
30 This model’s parameters are set to 0, K · Σt where Σt is the covariance matrix of the data in frame t and K > 1 is a constant. [sent-63, score-0.236]
31 t After the search stage, each time frame t has a list of M t models {Θt }T,M . [sent-65, score-0.248]
32 ”z = j” has the semantics of ”at time frame t the data is distributed according to the candidate mixture Θt ”. [sent-70, score-0.526]
33 In addition we define for each spike dt a hidden discrete j i t ’label’ random variable li . [sent-71, score-0.575]
34 This label indicates which Gaussian in the local mixture hy- Nt t pothesis is the source of the spike. [sent-72, score-0.4]
35 Denote by Lt = {li }i=1 the vector of labels of time frame t, and by L the vector of all the labels. [sent-73, score-0.3]
36 (B) A Bayesian network representation of the relations between the data D and the hidden labels H (see Section 3. [sent-77, score-0.233]
37 The visible labels L and the sampled data points are independent given the hidden labels. [sent-79, score-0.335]
38 d samples the joint log probability decomposes into T 1 T t log P (z ) + logP (z |z t=2 t−1 Nt t t [log P (li |z t ) + log P (dt |li , z t )] i )+ (1) t=1 i=1 We wish to maximize this log-likelihood over all possible choices of L, Z. [sent-83, score-0.225]
39 Labels are then unified using the correspondences established between the chosen mixtures in consecutive time frames. [sent-88, score-0.304]
40 3 A statistical distance between mixtures The transition CPDs of the form P (z t |z t−1 ) are based on the assumption that the Gaussian sources’ distributions are approximately stationary in pairs of consecutive time frames. [sent-91, score-0.44]
41 Under this assumption, two mixtures candidates estimated at consecutive time frames are viewed as two samples from a single unknown Gaussian mixture. [sent-92, score-0.617]
42 We assume that each Gaussian component from any of the two mixtures arises from a single Gaussian component in the joint hidden mixture, and so the hidden mixture induces a partition of the set of visible components into clusters. [sent-93, score-1.079]
43 Gaussian components in the same cluster are assumed to arise from the same hidden source. [sent-94, score-0.287]
44 1, the estimation of the transition probability is formalized as an optimization of a Jensen-Shannon based score over the possible partitions of the Gaussian components set. [sent-97, score-0.349]
45 If the family of allowed hidden mixture models is not further constrained, the optimization problem derived in Section 3. [sent-98, score-0.475]
46 1 is trivially solved by choosing the most detailed partition (each visible Gaussian component is a singleton). [sent-99, score-0.323]
47 2 we suggest natural constraints on the family of allowed partitions in the two cases of constant and variable number of clusters through time, and present algorithms for both cases. [sent-102, score-0.318]
48 1 A Jensen-Shannon based transition score Assume that in two consecutive time frames we observed two labeled samples (X 1 , L1 ), (X 2 , L2 ) of sizes N 1 , N 2 with empirical distributions Θ1 , Θ2 respectively. [sent-104, score-0.748]
49 By ’empirical distribution’, or ’type’ in the notation of [2], we denote the ML parameters of the sample, for both the multinomial distribution of the mixture weights and the Gaussian distributions of the components. [sent-105, score-0.284]
50 As stated above, we assume that the joint sample of size N = N 1 + N 2 is generated by a hidden Gaussian mixture ΘH with K H components, and its components are determined by a partition of the set of all components in Θ1 , Θ2 . [sent-106, score-0.645]
51 , K H } which matches each visible Gaussian component in Θ1 or Θ2 to its hidden source component in ΘH . [sent-112, score-0.43]
52 Denote the labels of the sample points Nj under the hidden mixture H = {hj }i=1 , j = 1, 2. [sent-113, score-0.409]
53 The values of these variables are given i j j by hj = R(li ), where li is the label index in the set of all components. [sent-114, score-0.236]
54 i The probabilistic dependence between a data point, its visible label, and its hidden label is explained by the Bayesian network model in Figure 1B. [sent-115, score-0.44]
55 We assume a data point is obtained by choosing a hidden label and then sample the point from the relevant hidden component. [sent-116, score-0.411]
56 The visible label is then sampled based on the hidden label using a multinomial distribution 1 2 with parameters Ψ = {Ψq }K +K , where Ψq = P (l = q|h = R(q)), i. [sent-117, score-0.561]
57 , the probability q=1 of the visible label q given the hidden label R(q) (since H is deterministic given L, P (l = q|h) = 0 for h = R(q)). [sent-119, score-0.525]
58 , K 1 + K 2 }) j the weight of model q in a naive joint mixture of Θ1 ,Θ2 , i. [sent-126, score-0.247]
59 wq H αm = wq , Ψq = H , µH = Ψq µq (4) m αR(q) {q:R(q)=m} ΣH = m {q:R(q)=m} Ψq (Σq + (µq − µH )(µq − µH )t ) m m {q:R(q)=m} Notice that the parameters of a hidden Gaussian, µH and ΣH , are just the mean and covarim m ance of the mixture q:R(q)=m Ψq G(x|µq , Σq ). [sent-129, score-0.551]
60 The summation over q in expression (3) can be interpreted as the Jensen-Shannon (JS) divergence between the components assigned to the hidden source m, under Gaussian assumptions. [sent-130, score-0.33]
61 For a given parametric family, the JS divergence is a non-negative measurement which can be used to test whether several samples are derived from a single distribution from the family or from a mixture of different ones [6]. [sent-131, score-0.369]
62 The JS divergence is computed for a mixture of n empirical distributions P1 , . [sent-132, score-0.346]
63 The ∗ ∗ mean and covariance of the mixture distribution µ , Σ are a function of the means and covariances of the components, with the formulae given in (4) for µH ,ΣH . [sent-138, score-0.29]
64 2 m=1 Constrained optimization and algorithms Consider first the case in which a one-to-one correspondence is assumed between clusters in two consecutive frames, and hence the number of Gaussian components K is constant over all time frames. [sent-144, score-0.46]
65 In this case, a mapping R is allowed iff it maps to each hidden source i a single Gaussian from mixture Θ1 and a single Gaussian from Θ2 . [sent-145, score-0.495]
66 Denoting −1 −1 the Gaussians matched to hidden i by R1 (i), R2 (i), the transition score (6) takes the K form of −N · max R i=1 −1 −1 S(R1 (i), R2 (i)). [sent-146, score-0.375]
67 Such an optimization of a pairwise matching score can be seen as a search for a maximal perfect matching in a weighted bipartite graph. [sent-147, score-0.283]
68 The one-to-one correspondence assumption is too strong for many data sets in the spike sorting application, as it ignores the phenomena of splits and merges of clusters. [sent-150, score-0.759]
69 We wish to allow such phenomena, but nevertheless enforce strong (though not perfect) demands of correspondence between the Gaussians in two consecutive frames. [sent-151, score-0.218]
70 Hence assignment of different Gaussians from the same mixture to the same hidden source is limited only for cases of a split or a merge. [sent-154, score-0.5]
71 The label entropy of the partition R should satisfy H H H(α1 , . [sent-156, score-0.266]
72 , αK 2 ) N N (7) Intuitively, the second constraint limits the allowed partitions to ones which are not richer than the visible partition, i. [sent-162, score-0.274]
73 Note that the most detailed partition (the partition into singletons) has a label entropy given by the r. [sent-165, score-0.411]
74 We start from the most detailed partition (each Gaussian is a singleton) and merge two clusters of the partition at each round. [sent-171, score-0.538]
75 At each round we look for a merge which incurs a minimal loss to the accumulated Jensen Shannon score (6) and a maximal loss to the mixture label entropy. [sent-173, score-0.666]
76 The algorithm terminates when 1 2 the accumulated label entropy reduction is bigger than H( N , N ) or when no allowed N N merges exist anymore. [sent-176, score-0.343]
77 We nevertheless compute the score based on the R found, since this partition obeys the first constraint and usually is not far from satisfying the second. [sent-178, score-0.275]
78 A recording session typically lasted 2 hours during which monkeys completed 600 trials. [sent-183, score-0.217]
79 7 Number of frames (%) 3386 (75%) 860 (19%) 243 (5%) 55 (1%) Number of electrodes (%) 13 (30%) 10 (23%) 10 (23%) 11 (25%) Table 1: Match scores between manual and automatic clustering. [sent-192, score-0.735]
80 In this data one cluster moves constantly, another splits into distinguished clusters, and at the end two clusters are merged. [sent-215, score-0.275]
81 The top and bottom rows show manual and automatic clustering solutions respectively. [sent-216, score-0.582]
82 Notice that during the split process of the bottom left area some ambiguous time frames exist in which 1,2, or 3 cluster descriptions are reasonable. [sent-217, score-0.483]
83 The numbers below the images show the f 1 score of the 2 local match between the manual and the automatic clustering solutions (see text). [sent-220, score-0.762]
84 The manual and automatic clustering results were compared 2P R 1 using a combined measure of precision P and recall R scores f 2 = R+P . [sent-225, score-0.612]
85 Statistics on the match between manual and automated clustering are described in Table 1. [sent-227, score-0.498]
86 In order to understand the score’s scale we note that random clustering (with the same 1 label distribution as the manual clustering) gets an f 2 score of 0. [sent-228, score-0.692]
87 The trivial clustering which assigns all the points to the same label gets mean scores of 0. [sent-230, score-0.451]
88 67 for single frame matching and whole electrode matching respectively. [sent-232, score-0.416]
89 The scores of single frames are much higher than the full electrode scores, since the problem is much harder in the latter case. [sent-233, score-0.44]
90 A single wrong correspondence between two consecutive frames may reduce the electrode’s score dramatically, while being unnoticed by the single frame score. [sent-234, score-0.766]
91 In most cases the algorithm gives reasonably evolving clustering, even when it disagrees with the manual solution. [sent-235, score-0.215]
92 Low matching scores between the manual and the automatic clustering may result from inherent ambiguity in the data. [sent-237, score-0.706]
93 As a preliminary assessment of this hypothesis we obtained a second, independent, manual clustering for the data set for which we got the lowest match scores. [sent-238, score-0.446]
94 The matching scores between manual and automatic clustering are presented in Figure 3A. [sent-239, score-0.666]
95 68 (A) 1 H2 (B1 ) (B2 ) (B3 ) (B4 ) Figure 3: (A) Comparison of our automatic clustering with 2 independent manual clustering solutions for our worst matched data points. [sent-243, score-0.763]
96 (B) Functional validation of clustering results: (1) At the beginning of a recording session, three clusters were identified. [sent-245, score-0.44]
97 In some cases, validity of the automatic clustering can be assessed by checking functional properties associated with the underlying neurons. [sent-250, score-0.293]
98 A review of methods for spike sorting: the detection and classification of neural action potentials. [sent-288, score-0.271]
99 Characterization of and compensation for the non-stationarity of spike shapes during physiological recordings. [sent-298, score-0.271]
100 Robust, automatic spike sorting using mixtures of multivariate t-distributions. [sent-305, score-0.711]
wordName wordTfidf (topN-words)
[('spike', 0.271), ('frames', 0.223), ('manual', 0.215), ('js', 0.212), ('mixture', 0.212), ('sorting', 0.203), ('frame', 0.195), ('clustering', 0.181), ('hidden', 0.145), ('partition', 0.145), ('visible', 0.138), ('clusters', 0.135), ('score', 0.13), ('gaussians', 0.128), ('consecutive', 0.126), ('gaussian', 0.126), ('mixtures', 0.125), ('cpds', 0.122), ('label', 0.121), ('electrode', 0.113), ('merge', 0.113), ('automatic', 0.112), ('merges', 0.106), ('scores', 0.104), ('transition', 0.1), ('wq', 0.097), ('correspondence', 0.092), ('stationarity', 0.09), ('cluster', 0.088), ('spikes', 0.084), ('dt', 0.083), ('electrodes', 0.081), ('descriptions', 0.081), ('recording', 0.081), ('li', 0.076), ('solutions', 0.074), ('allowed', 0.071), ('source', 0.067), ('candidate', 0.066), ('partitions', 0.065), ('divergence', 0.064), ('adams', 0.061), ('aharonbh', 0.061), ('comply', 0.061), ('hungarian', 0.061), ('singleton', 0.061), ('session', 0.061), ('recorded', 0.058), ('bayesian', 0.056), ('hebrew', 0.056), ('jerusalem', 0.056), ('matching', 0.054), ('components', 0.054), ('kh', 0.053), ('time', 0.053), ('labels', 0.052), ('automated', 0.052), ('splits', 0.052), ('stage', 0.051), ('match', 0.05), ('labor', 0.048), ('sources', 0.048), ('log', 0.048), ('family', 0.047), ('samples', 0.046), ('em', 0.045), ('dkl', 0.045), ('accumulated', 0.045), ('waveforms', 0.045), ('gets', 0.045), ('maximal', 0.045), ('candidates', 0.044), ('validation', 0.043), ('lewicki', 0.042), ('neuroscience', 0.042), ('covariance', 0.041), ('ambiguity', 0.04), ('scenarios', 0.04), ('map', 0.04), ('component', 0.04), ('human', 0.039), ('nt', 0.039), ('monkeys', 0.039), ('hj', 0.039), ('assignment', 0.038), ('split', 0.038), ('recordings', 0.037), ('tracked', 0.037), ('covariances', 0.037), ('network', 0.036), ('pca', 0.036), ('multinomial', 0.036), ('hours', 0.036), ('distributions', 0.036), ('joint', 0.035), ('hmm', 0.035), ('miller', 0.035), ('phenomena', 0.035), ('empirical', 0.034), ('movements', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
Author: Aharon Bar-hillel, Adam Spiro, Eran Stark
Abstract: Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. It is a complicated problem, which requires a lot of human labor, partly due to the non-stationary nature of the data. We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. At a first search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. Transition probabilities are computed using local stationarity assumptions and are based on a Gaussian version of the Jensen-Shannon divergence. The method was applied to several recordings. The performance appeared almost indistinguishable from humans in a wide range of scenarios, including movement, merges, and splits of clusters. 1
2 0.1760253 148 nips-2004-Probabilistic Computation in Spiking Populations
Author: Richard S. Zemel, Rama Natarajan, Peter Dayan, Quentin J. Huys
Abstract: As animals interact with their environments, they must constantly update estimates about their states. Bayesian models combine prior probabilities, a dynamical model and sensory evidence to update estimates optimally. These models are consistent with the results of many diverse psychophysical studies. However, little is known about the neural representation and manipulation of such Bayesian information, particularly in populations of spiking neurons. We consider this issue, suggesting a model based on standard neural architecture and activations. We illustrate the approach on a simple random walk example, and apply it to a sensorimotor integration task that provides a particularly compelling example of dynamic probabilistic computation. Bayesian models have been used to explain a gamut of experimental results in tasks which require estimates to be derived from multiple sensory cues. These include a wide range of psychophysical studies of perception;13 motor action;7 and decision-making.3, 5 Central to Bayesian inference is that computations are sensitive to uncertainties about afferent and efferent quantities, arising from ignorance, noise, or inherent ambiguity (e.g., the aperture problem), and that these uncertainties change over time as information accumulates and dissipates. Understanding how neurons represent and manipulate uncertain quantities is therefore key to understanding the neural instantiation of these Bayesian inferences. Most previous work on representing probabilistic inference in neural populations has focused on the representation of static information.1, 12, 15 These encompass various strategies for encoding and decoding uncertain quantities, but do not readily generalize to real-world dynamic information processing tasks, particularly the most interesting cases with stimuli changing over the same timescale as spiking itself.11 Notable exceptions are the recent, seminal, but, as we argue, representationally restricted, models proposed by Gold and Shadlen,5 Rao,10 and Deneve.4 In this paper, we first show how probabilistic information varying over time can be represented in a spiking population code. Second, we present a method for producing spiking codes that facilitate further processing of the probabilistic information. Finally, we show the utility of this method by applying it to a temporal sensorimotor integration task. 1 TRAJECTORY ENCODING AND DECODING We assume that population spikes R(t) arise stochastically in relation to the trajectory X(t) of an underlying (but hidden) variable. We use RT and XT for the whole trajectory and spike trains respectively from time 0 to T . The spikes RT constitute the observations and are assumed to be probabilistically related to the signal by a tuning function f (X, θ i ): P (R(i, T )|X(T )) ∝ f (X, θi ) (1) for the spike train of the ith neuron, with parameters θi . Therefore, via standard Bayesian inference, RT determines a distribution over the hidden variable at time T , P (X(T )|RT ). We first consider a version of the dynamics and input coding that permits an analytical examination of the impact of spikes. Let X(t) follow a stationary Gaussian process such that the joint distribution P (X(t1 ), X(t2 ), . . . , X(tm )) is Gaussian for any finite collection of times, with a covariance matrix which depends on time differences: Ctt = c(|t − t |). Function c(|∆t|) controls the smoothness of the resulting random walks. Then, P (X(T )|RT ) ∝ p(X(T )) X(T ) dX(T )P (RT |X(T ))P (X(T )|X(T )) (2) where P (X(T )|X(T )) is the distribution over the whole trajectory X(T ) conditional on the value of X(T ) at its end point. If RT are a set of conditionally independent inhomogeneous Poisson processes, we have P (RT |X(T )) ∝ iτ f (X(tiτ ), θi ) exp − i τ dτ f (X(τ ), θi ) , (3) where tiτ ∀τ are the spike times τ of neuron i in RT . Let χ = [X(tiτ )] be the vector of stimulus positions at the times at which we observed a spike and Θ = [θ(tiτ )] be the vector of spike positions. If the tuning functions are Gaussian f (X, θi ) ∝ exp(−(X − θi )2 /2σ 2 ) and sufficiently dense that i τ dτ f (X, θi ) is independent of X (a standard assumption in population coding), then P (RT |X(T )) ∝ exp(− χ − Θ 2 /2σ 2 ) and in Equation 2, we can marginalize out X(T ) except at the spike times tiτ : P (X(T )|RT ) ∝ p(X(T )) −1 χ dχ exp −[χ, X(T )]T C 2 [χ, X(T )] − χ−Θ 2σ 2 2 (4) and C is the block covariance matrix between X(tiτ ), x(T ) at the spike times [ttτ ] and the final time T . This Gaussian integral has P (X(T )|RT ) ∼ N (µ(T ), ν(T )), with µ(T ) = CT t (Ctt + Iσ 2 )−1 Θ = kΘ ν(T ) = CT T − kCtT (5) CT T is the T, T th element of the covariance matrix and CT t is similarly a row vector. The dependence in µ on past spike times is specified chiefly by the inverse covariance matrix, and acts as an effective kernel (k). This kernel is not stationary, since it depends on factors such as the local density of spiking in the spike train RT . For example, consider where X(t) evolves according to a diffusion process with drift: dX = −αXdt + σ dN (t) (6) where α prevents it from wandering too far, N (t) is white Gaussian noise with mean zero and σ 2 variance. Figure 1A shows sample kernels for this process. Inspection of Figure 1A reveals some important traits. First, the monotonically decreasing kernel magnitude as the time span between the spike and the current time T grows matches the intuition that recent spikes play a more significant role in determining the posterior over X(T ). Second, the kernel is nearly exponential, with a time constant that depends on the time constant of the covariance function and the density of the spikes; two settings of these parameters produced the two groupings of kernels in the figure. Finally, the fully adaptive kernel k can be locally well approximated by a metronomic kernel k (shown in red in Figure 1A) that assumes regular spiking. This takes advantage of the general fact, indicated by the grouping of kernels, that the kernel depends weakly on the actual spike pattern, but strongly on the average rate. The merits of the metronomic kernel are that it is stationary and only depends on a single mean rate rather than the full spike train RT . It also justifies s Kernels k and k −0.5 C 5 0 0.03 0.06 0.09 0.04 0.06 0.08 t−t Time spike True stimulus and means D Full kernel E Regular, stationary kernel −0.5 0 −0.5 0.03 0.04 0.05 0.06 0.07 Time 0.08 0.09 0 0.5 0.1 Space 0 Space −4 10 Space Variance ratio 10 −2 10 0.5 B ν2 / σ2 Kernel size (weight) A 0.1 0 0.5 0.03 0.04 0.05 0.06 0.07 Time 0.08 0.09 0.1 Figure 1: Exact and approximate spike decoding with the Gaussian process prior. Spikes are shown in yellow, the true stimulus in green, and P (X(T )|RT ) in gray. Blue: exact inference with nonstationary and red: approximate inference with regular spiking. A Kernel samples for a diffusion process as defined by equations 5, 6. B, C: Mean and variance of the inference. D: Exact inference with full kernel k and E: approximation based on metronomic kernel k
3 0.16418907 112 nips-2004-Maximising Sensitivity in a Spiking Network
Author: Anthony J. Bell, Lucas C. Parra
Abstract: We use unsupervised probabilistic machine learning ideas to try to explain the kinds of learning observed in real neurons, the goal being to connect abstract principles of self-organisation to known biophysical processes. For example, we would like to explain Spike TimingDependent Plasticity (see [5,6] and Figure 3A), in terms of information theory. Starting out, we explore the optimisation of a network sensitivity measure related to maximising the mutual information between input spike timings and output spike timings. Our derivations are analogous to those in ICA, except that the sensitivity of output timings to input timings is maximised, rather than the sensitivity of output ‘firing rates’ to inputs. ICA and related approaches have been successful in explaining the learning of many properties of early visual receptive fields in rate coding models, and we are hoping for similar gains in understanding of spike coding in networks, and how this is supported, in principled probabilistic ways, by cellular biophysical processes. For now, in our initial simulations, we show that our derived rule can learn synaptic weights which can unmix, or demultiplex, mixed spike trains. That is, it can recover independent point processes embedded in distributed correlated input spike trains, using an adaptive single-layer feedforward spiking network. 1 Maximising Sensitivity. In this section, we will follow the structure of the ICA derivation [4] in developing the spiking theory. We cannot claim, as before, that this gives us an information maximisation algorithm, for reasons that we will delay addressing until Section 3. But for now, to first develop our approach, we will explore an interim objective function called sensitivity which we define as the log Jacobian of how input spike timings affect output spike timings. 1.1 How to maximise the effect of one spike timing on another. Consider a spike in neuron j at time tl that has an effect on the timing of another spike in neuron i at time tk . The neurons are connected by a weight wij . We use i and j to index neurons, and k and l to index spikes, but sometimes for convenience we will use spike indices in place of neuron indices. For example, wkl , the weight between an input spike l and an output spike k, is naturally understood to be just the corresponding wij . dtk dtl threshold potential du u(t) R(t) resting potential tk output spikes tl input spikes Figure 1: Firing time tk is determined by the time of threshold crossing. A change of an input spike time dtl affects, via a change of the membrane potential du the time of the output spike by dtk . In the simplest version of the Spike Response Model [7], spike l has an effect on spike k that depends on the time-course of the evoked EPSP or IPSP, which we write as R kl (tk − tl ). In general, this Rkl models both synaptic and dendritic linear responses to an input spike, and thus models synapse type and location. For learning, we need only consider the value of this function when an output spike, k, occurs. In this model, depicted in Figure 1, a neuron adds up its spiking inputs until its membrane potential, ui (t), reaches threshold at time tk . This threshold we will often, again for convenience, write as uk ≡ ui (tk , {tl }), and it is given by a sum over spikes l: uk = wkl Rkl (tk − tl ) . (1) l To maximise timing sensitivity, we need to determine the effect of a small change in the input firing time tl on the output firing time tk . (A related problem is tackled in [2].) When tl is changed by a small amount dtl the membrane potential will change as a result. This change in the membrane potential leads to a change in the time of threshold crossing dt k . The contribution to the membrane potential, du, due to dtl is (∂uk /∂tl )dtl , and the change in du corresponding to a change dtk is (∂uk /∂tk )dtk . We can relate these two effects by noting that the total change of the membrane potential du has to vanish because u k is defined as the potential at threshold. ie: du = ∂uk ∂uk dtk + dtl = 0 . ∂tk ∂tl (2) This is the total differential of the function uk = u(tk , {tl }), and is a special case of the implicit function theorem. Rearranging this: dtk ∂uk =− dtl ∂tl ∂uk ˙ = −wkl Rkl /uk . ˙ ∂tk (3) Now, to connect with the standard ICA derivation [4], recall the ‘rate’ (or sigmoidal) neuron, for which yi = gi (ui ) and ui = j wij xj . For this neuron, the output dependence on input is ∂yi /∂xj = wij gi while the learning gradient is: ∂yi ∂ 1 log − fi (ui )xj = ∂wij ∂xj wij (4) where the ‘score functions’, fi , are defined in terms of a density estimate on the summed ∂ ∂ inputs: fi (ui ) = ∂ui log gi = ∂ui log p(ui ). ˆ The analogous learning gradient for the spiking case, from (3), is: ˙ j(a)Rka ∂ dtk 1 log − a . = ∂wij dtl wij uk ˙ (5) where j(a) = 1 if spike a came from neuron j, and 0 otherwise. Comparing the two cases in (4) and (5), we see that the input variable xj has become the temporal derivative of the sum of the EPSPs coming from synapse j, and the output variable (or score function) fi (ui ) has become u−1 , the inverse of the temporal derivative ˙k of the membrane potential at threshold. It is intriguing (A) to see this quantity appear as analogous to the score function in the ICA likelihood model, and, (B) to speculate that experiments could show that this‘ voltage slope at threshold’ is a hidden factor in STDP data, explaining some of the scatter in Figure 3A. In other words, an STDP datapoint should lie on a 2-surface in a 3D space of {∆w, ∆t, uk }. Incidentally, uk shows up in any ˙ ˙ learning rule optimising an objective function involving output spike timings. 1.2 How to maximise the effect of N spike timings on N other ones. Now we deal with the case of a ‘square’ single-layer feedforward mapping between spike timings. There can be several input and output neurons, but here we ignore which neurons are spiking, and just look at how the input timings affect the output timings. This is captured in a Jacobian matrix of all timing dependencies we call T. The entries of this matrix are Tkl ≡ ∂tk /∂tl . A multivariate version of the sensitivity measure introduced in the previous section is the log of the absolute determinant of the timing matrix, ie: log |T|. The full derivation for the gradient W log |T| is in the Appendix. Here, we again draw out the analogy between Square ICA [4] and this gradient, as follows. Square ICA with a network y = g(Wx) is: ∆W ∝ W log |J| = W−1 − f (u)xT (6) where the Jacobian J has entries ∂yi /∂xj and the score functions are now, fi (u) = ∂ − ∂ui log p(u) for the general likelihood case, with p(u) = i gi being the special case of ˆ ˆ ICA. We will now split the gradient in (6) according to the chain rule: W log |J| = [ J log |J|] ⊗ [ W J] j(l) − fk (u)xj wkl J−T ⊗ Jkl i(k) = (7) . (8) In this equation, i(k) = δik and j(l) = δjl . The righthand term is a 4-tensor with entries ∂Jkl /∂wij , and ⊗ is defined as A ⊗ Bij = kl Akl Bklij . We write the gradient this way to preserve, in the second term, the independent structure of the 1 → 1 gradient term in (4), and to separate a difficult derivation into two easy parts. The structure of (8) holds up when we move to the spiking case, giving: W log |T| = = [ T log |T|] ⊗ [ W T] T−T ⊗ Tkl i(k) j(l) − wkl (9) a ˙ j(a)Rka uk ˙ (10) where i(k) is now defined as being 1 if spike k occured in neuron i, and 0 otherwise. j(l) and j(a) are analogously defined. Because the T matrix is much bigger than the J matrix, and because it’s entries are more complex, here the similarity ends. When (10) is evaluated for a single weight influencing a single spike coupling (see the Appendix for the full derivation), it yields: ∆wkl ∝ ∂ log |T| Tkl = ∂wkl wkl T−1 lk −1 , (11) This is a non-local update involving a matrix inverse at each step. In the ICA case of (6), such an inverse was removed by the Natural Gradient transform (see [1]), but in the spike timing case, this has turned out not to be possible, because of the additional asymmetry ˙ introduced into the T matrix (as opposed to the J matrix) by the Rkl term in (3). 2 Results. Nonetheless, this learning rule can be simulated. It requires running the network for a while to generate spikes (and a corresponding T matrix), and then for each input/output spike coupling, the corresponding synapse is updated according to (11). When this is done, and the weights learn, it is clear that something has been sacrificed by ignoring the issue of which neurons are producing the spikes. Specifically, the network will often put all the output spikes on one output neuron, with the rates of the others falling to zero. It is happy to do this, if a large log |T| can thereby be achieved, because we have not included this ‘which neuron’ information in the objective. We will address these and other problems in Section 3, but now we report on our simulation results on demultiplexing. 2.1 Demultiplexing spike trains. An interesting possibility in the brain is that ‘patterns’ are embedded in spatially distributed spike timings that are input to neurons. Several patterns could be embedded in single input trains. This is called multiplexing. To extract and propagate these patterns, the neurons must demultiplex these inputs using its threshold nonlinearity. Demultiplexing is the ‘point process’ analog of the unmixing of independent inputs in ICA. We have been able to robustly achieve demultiplexing, as we now report. We simulated a feed-forward network with 3 integrate-and-fire neurons and inputs from 3 presynaptic neurons. Learning followed (11) where we replace the inverse by the pseudoinverse computed on the spikes generated during 0.5 s. The pseudo-inverse is necessary because even though on average, the learning matches number of output spikes to number of input spikes, the matrix T is still not usually square and so its actual inverse cannot be taken. In addition, in these simulations, an additional term is introduced in the learning to make sure all the output neurons fire with equal probability. This partially counters the ignoral of the ‘which neuron’ information, which we explained above. Assuming Poisson spike count ni for the ith output neuron with equal firing rate ni it is easy to derive in an approximate ¯ term that will control the spike count, i (¯ i − ni ). The target firing rates ni were set to n ¯ match the “source” spike train in this example. The network learns to demultiplex mixed spike trains, as shown in Figure 2. This demultiplexing is a robust property of learning using (11) with this new spike-controlling term. Finally, what about the spike-timing dependendence of the observed learning? Does it match experimental results? The comparison is made in Figure 3, and the answer is no. There is a timing-dependent transition between depression and potentiation in our result Spike Trains mixing mixed input trains 1 1 0.8 2 0.6 3 0 50 100 150 200 250 300 350 400 450 0.4 500 0.2 output 1 0 2 3 synaptic weights 0 50 100 150 200 250 300 350 400 450 500 original spike train 1 1 0.5 2 0 3 0 50 100 150 200 250 time in ms 300 350 400 450 500 −0.5 Figure 2: Unmixed spike trains. The input (top lef) are 3 spike trains which are a mixture of three independent Poison processes (bottom left). The network unmixes the spike train to approximately recover the original (center left). In this example 19 spikes correspond to the original with 4 deletion and 2 insertions. The two panels at the right show the mixing (top) and synaptic weight matrix after training (bottom). in Figure 3B, but it is not a sharp transition like the experimental result in Figure 3A. In addition, it does not transition at zero (ie: when tk − tl = 0), but at a time offset by the rise time of the EPSPs. In earlier experiments, in which we tranformed the gradient in (11) by an approximate inverse Hessian, to get an approximate Natural Gradient method, a sharp transition did emerge in simulations. However, the approximate inverse Hessian was singular, and we had to de-emphasise this result. It does suggest, however, that if the Natural Gradient transform can be usefully done on some variant of this learning rule, it may well be what accounts for the sharp transition effect of STDP. 3 Discussion Although these derivations started out smoothly, the reader possibly shares the authors’ frustration at the approximations involved here. Why isn’t this simple, like ICA? Why don’t we just have a nice maximum spikelihood model, ie: a density estimation algorithm for multivariate point processes, as ICA was a model in continuous space? We are going to be explicit about the problems now, and will propose a direction where the solution may lie. The over-riding problem is: we are unable to claim that in maximising log |T|, we are maximising the mutual information between inputs and outputs because: 1. The Invertability Problem. Algorithms such as ICA which maximise log Jacobians can only be called Infomax algorithms if the network transformation is both deterministic and invertable. The Spike Response Model is deterministic, but it is not invertable in general. When not invertable, the key formula (considering here vectors of input and output timings, tin and tout )is transformed from simple to complex. ie: p(tout ) = p(tin ) becomes p(tout ) = |T| solns tin p(tin ) d tin |T| (12) Thus when not invertable, we need to know the Jacobians of all the inputs that could have caused an output (called here ‘solns’), something we simply don’t know. 2. The ‘Which Neuron’ Problem. Instead of maximising the mutual information I(tout , tin ), we should be maximising I(tiout , tiin ), where the vector ti is the timing (A) STDP (B) Gradient 100 ∆ w (a.u.) 150 100 ∆ w / w (%) 150 50 0 −50 −100 −100 50 0 −50 −50 0 ∆ t (ms) 50 100 −100 −20 0 20 40 60 ∆ t (ms) 80 100 Figure 3: Dependence of synaptic modification on pre/post inter-spike interval. Left (A): From Froemke & Dan, Nature (2002)]. Dependence of synaptic modification on pre/post inter-spike interval in cat L2/3 visual cortical pyramidal cells in slice. Naturalistic spike trains. Each point represents one experiment. Right (B): According to Equation (11). Each point corresponds to an spike pair between approximately 100 input and 100 output spikes. vector, t, with the vector, i, of corresponding neuron indices, concatenated. Thus, ‘who spiked?’ should be included in the analysis as it is part of the information. 3. The Predictive Information Problem. In ICA, since there was no time involved, we did not have to worry about mutual informations over time between inputs and outputs. But in the spiking model, output spikes may well have (predictive) mutual information with future input spikes, as well as the usual (causal) mutual information with past input spikes. The former has been entirely missing from our analysis so far. These temporal and spatial infomation dependencies missing in our analysis so far, are thrown into a different light by a single empirical observation, which is that Spike TimingDependent Plasticity is not just a feedforward computation like the Spike Response Model. Specifically, there must be at least a statistical, if not a causal, relation between a real synapse’s plasticity and its neuron’s output spike timings, for Figure 3B to look like it does. It seems we have to confront the need for both a ‘memory’ (or reconstruction) model, such as the T we have thus far dealt with, in which output spikes talk about past inputs, and a ‘prediction’ model, in which they talk about future inputs. This is most easily understood from the point of view of Barber & Agakov’s variational Infomax algorithm [3]. They argue for optimising a lower bound on mutual information, which, for our neurons’, would be expressed using an inverse model p, as follows: ˆ I(tiin , tiout ) = H(tiin ) − log p(tiin |tiout ) ˆ p(tiin ,tiout ) ≤ I(tiin , tiout ) (13) In a feedforward model, H(tiin ) may be disregarded in taking gradients, leading us to the optimisation of a ‘memory-prediction’ model p(tiin |tiout ) related to something supposˆ edly happening in dendrites, somas and at synapses. In trying to guess what this might be, it would be nice if the math worked out. We need a square Jacobian matrix, T, so that |T| = p(tiin |tiout ) can be our memory/prediction model. Now let’s rename our feedforˆ → − ward timing Jacobian T (‘up the dendritic trees’), as T, and let’s fantasise that there is ← − some, as yet unspecified, feedback Jacobian T (‘down the dendritic trees’), which covers → − electrotonic influences as they spread from soma to synapse, and which T can be combined with by some operation ‘⊗’ to make things square. Imagine further, that doing this → ← − − yields a memory/prediction model on the inputs. Then the T we are looking for is T ⊗ T, → ← − − and the memory-prediction model is: p(tiin |tiout ) = T ⊗ T ˆ → − → − Ideally, the entries of T should be as before, ie: T kl = ∂tk /∂tl . What should the entries ← − ← − ← − of T be? Becoming just one step more concrete, suppose T had entries T lk = ∂cl /∂tk , where cl is some, as yet unspecified, value, or process, occuring at an input synapse when spike l comes in. What seems clear is that ⊗ should combine the correctly tensorised forms → − ← − → ← − − of T and T (giving them each 4 indices ijkl), so that T = T ⊗ T sums over the spikes k and l to give a I × J matrix, where I is the number of output neurons, and J the number of input neurons. Then our quantity, T, would represent all dependencies of input neuronal activity on output activity, summed over spikes. ← − Further, we imagine that T contains reverse (feedback) electrotonic transforms from soma ← − to synapse R lk that are somehow symmetrically related to the feedforward Spike Re→ − sponses from synapse to soma, which we now rename R kl . Thinking for a moment in terms of somatic k and synaptic l, voltages V , currents I and linear cable theory, the synapse to → − → − soma transform, R kl would be related to an impedance in Vk = Il Z kl , while the soma ← − ← − to synapse transform, R lk would be related to an admittance in Il = Vk Y lk [8]. The → − ← − symmetry in these equations is that Z kl is just the inverse conjugate of Y lk . Finally, then, what is cl ? And what is its relation to the calcium concentration, [Ca2+ ]l , at a synapse, when spike l comes in? These questions naturally follow from considering the experimental data, since it is known that the calcium level at synapses is the critical integrating factor in determining whether potentiation or depression occurs [5]. 4 Appendix: Gradient of log |T| for the full Spike Response Model. Here we give full details of the gradient for Gerstner’s Spike Response Model [7]. This is a general model for which Integrate-and-Fire is a special case. In this model the effect of a presynaptic spike at time tl on the membrane potential at time t is described by a post synaptic potential or spike response, which may also depend on the time that has passed since the last output spike tk−1 , hence the spike response is written as R(t − tk−1 , t − tl ). This response is weighted by the synaptic strength wl . Excitatory or inhibitory synapses are determined by the sign of wl . Refractoriness is incorporated by adding a hyper-polarizing contribution (spike-afterpotential) to the membrane potential in response to the last preceding spike η(t − tk−1 ). The membrane potential as a function of time is therefore given by u(t) = η(t − tk−1 ) + wl R(t − tk−1 , t − tl ) . (14) l We have ignored here potential contributions from external currents which can easily be included without modifying the following derivations. The output firing times t k are defined as the times for which u(t) reaches firing threshold from below. We consider a dynamic threshold, ϑ(t − tk−1 ), which may depend on the time since that last spike tk−1 , together then output spike times are defined implicitly by: t = tk : u(t) = ϑ(t − tk−1 ) and du(t) > 0. dt (15) For this more general model Tkl is given by Tkl = dtk =− dtl ∂u ∂ϑ − ∂tk ∂tk −1 ˙ ∂u wkl R(tk − tk−1 , tk − tl , ) = , ˙ ∂tl u(tk ) − ϑ(tk − tk−1 ) ˙ (16) ˙ ˙ where R(s, t), u(t), and ϑ(t) are derivatives with respect to t. The dependence of Tkl on ˙ tk−1 should be implicitly assumed. It has been omitted to simplify the notation. Now we compute the derivative of log |T| with respect to wkl . For any matrix T we have ∂ log |T|/∂Tab = [T−1 ]ba . Therefore: ∂ log |T| ∂Tab ∂ log |T| ∂Tab = [T−1 ]ba . (17) ∂wkl ∂Tab ∂wkl ∂wkl ab ab Utilising the Kronecker delta δab = (1 if a = b, else 0), the derivative of (16) with respect to wkl gives: ˙ ∂Tab ∂ wab R(ta − ta−1 , ta − tb ) = ˙ ˙ ∂wkl ∂wkl η(ta − ta−1 ) + wac R(ta − ta−1 , ta − tc ) − ϑ(ta − ta−1 ) c ˙ R(ta − ta−1 , ta − tb ) = δak δbl ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ ˙ ˙ wab R(ta − ta−1 , ta − tb )δak R(ta − ta−1 , ta − tl ) − 2 ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ = δak Tab Therefore: ∂ log |T| ∂wkl Tal δbl − wab wal . (18) δbl Tal − wab wal [T−1 ]ba δak Tab = ab = Tkl wkl [T−1 ]lk − [T−1 ]bk Tkl b (19) = Tkl [T−1 ]lk − 1 . wkl (20) Acknowledgments We are grateful for inspirational discussions with Nihat Ay, Michael Eisele, Hong Hui Yu, Jim Crutchfield, Jeff Beck, Surya Ganguli, Sophi` Deneve, David Barber, Fabian Theis, e Tony Zador and Arunava Banerjee. AJB thanks all RNI colleagues for many such discussions. References [1] Amari S-I. 1997. Natural gradient works efficiently in learning, Neural Computation, 10, 251-276 [2] Banerjee A. 2001. On the Phase-Space Dynamics of Systems of Spiking Neurons. Neural Computation, 13, 161-225 [3] Barber D. & Agakov F. 2003. The IM Algorithm: A Variational Approach to Information Maximization. Advances in Neural Information Processing Systems 16, MIT Press. [4] Bell A.J. & Sejnowski T.J. 1995. An information maximization approach to blind separation and blind deconvolution, Neural Computation, 7, 1129-1159 [5] Dan Y. & Poo M-m. 2004. Spike timing-dependent plasticity of neural circuits, Neuron, 44, 23-30 [6] Froemke R.C. & Dan Y. 2002. Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 28, 416: 433-8 [7] Gerstner W. & Kistner W.M. 2002. Spiking neuron models, Camb. Univ. Press [8] Zador A.M., Agmon-Snir H. & Segev I. 1995. The morphoelectrotonic transform: a graphical approach to dendritic function, J. Neurosci., 15(3): 1669-1682
4 0.16286133 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern
Author: Kosuke Hamaguchi, Masato Okada, Kazuyuki Aihara
Abstract: Repeated spike patterns have often been taken as evidence for the synfire chain, a phenomenon that a stable spike synchrony propagates through a feedforward network. Inter-spike intervals which represent a repeated spike pattern are influenced by the propagation speed of a spike packet. However, the relation between the propagation speed and network structure is not well understood. While it is apparent that the propagation speed depends on the excitatory synapse strength, it might also be related to spike patterns. We analyze a feedforward network with Mexican-Hattype connectivity (FMH) using the Fokker-Planck equation. We show that both a uniform and a localized spike packet are stable in the FMH in a certain parameter region. We also demonstrate that the propagation speed depends on the distinct firing patterns in the same network.
5 0.15688469 28 nips-2004-Bayesian inference in spiking neurons
Author: Sophie Deneve
Abstract: We propose a new interpretation of spiking neurons as Bayesian integrators accumulating evidence over time about events in the external world or the body, and communicating to other neurons their certainties about these events. In this model, spikes signal the occurrence of new information, i.e. what cannot be predicted from the past activity. As a result, firing statistics are close to Poisson, albeit providing a deterministic representation of probabilities. We proceed to develop a theory of Bayesian inference in spiking neural networks, recurrent interactions implementing a variant of belief propagation. Many perceptual and motor tasks performed by the central nervous system are probabilistic, and can be described in a Bayesian framework [4, 3]. A few important but hidden properties, such as direction of motion, or appropriate motor commands, are inferred from many noisy, local and ambiguous sensory cues. These evidences are combined with priors about the sensory world and body. Importantly, because most of these inferences should lead to quick and irreversible decisions in a perpetually changing world, noisy cues have to be integrated on-line, but in a way that takes into account unpredictable events, such as a sudden change in motion direction or the appearance of a new stimulus. This raises the question of how this temporal integration can be performed at the neural level. It has been proposed that single neurons in sensory cortices represent and compute the log probability that a sensory variable takes on a certain value (eg Is visual motion in the neuron’s preferred direction?) [9, 7]. Alternatively, to avoid normalization issues and provide an appropriate signal for decision making, neurons could represent the log probability ratio of a particular hypothesis (eg is motion more likely to be towards the right than towards the left) [7, 6]. Log probabilities are convenient here, since under some assumptions, independent noisy cues simply combine linearly. Moreover, there are physiological evidence for the neural representation of log probabilities and log probability ratios [9, 6, 7]. However, these models assume that neurons represent probabilities in their firing rates. We argue that it is important to study how probabilistic information are encoded in spikes. Indeed, it seems spurious to marry the idea of an exquisite on-line integration of noisy cues with an underlying rate code that requires averaging on large populations of noisy neurons and long periods of time. In particular, most natural tasks require this integration to take place on the time scale of inter-spike intervals. Spikes are more efficiently signaling events ∗ Institute of Cognitive Science, 69645 Bron, France than analog quantities. In addition, a neural theory of inference with spikes will bring us closer to the physiological level and generate more easily testable predictions. Thus, we propose a new theory of neural processing in which spike trains provide a deterministic, online representation of a log-probability ratio. Spikes signals events, eg that the log-probability ratio has exceeded what could be predicted from previous spikes. This form of coding was loosely inspired by the idea of ”energy landscape” coding proposed by Hinton and Brown [2]. However, contrary to [2] and other theories using rate-based representation of probabilities, this model is self-consistent and does not require different models for encoding and decoding: As output spikes provide new, unpredictable, temporally independent evidence, they can be used directly as an input to other Bayesian neurons. Finally, we show that these neurons can be used as building blocks in a theory of approximate Bayesian inference in recurrent spiking networks. Connections between neurons implement an underlying Bayesian network, consisting of coupled hidden Markov models. Propagation of spikes is a form of belief propagation in this underlying graphical model. Our theory provides computational explanations of some general physiological properties of cortical neurons, such as spike frequency adaptation, Poisson statistics of spike trains, the existence of strong local inhibition in cortical columns, and the maintenance of a tight balance between excitation and inhibition. Finally, we discuss the implications of this model for the debate about temporal versus rate-based neural coding. 1 Spikes and log posterior odds 1.1 Synaptic integration seen as inference in a hidden Markov chain We propose that each neuron codes for an underlying ”hidden” binary variable, xt , whose state evolves over time. We assume that xt depends only on the state at the previous time step, xt−dt , and is conditionally independent of other past states. The state xt can switch 1 from 0 to 1 with a constant rate ron = dt limdt→0 P (xt = 1|xt−dt = 0), and from 1 to 0 with a constant rate roff . For example, these transition rates could represent how often motion in a preferred direction appears the receptive field and how long it is likely to stay there. The neuron infers the state of its hidden variable from N noisy synaptic inputs, considered to be observations of the hidden state. In this initial version of the model, we assume that these inputs are conditionally independent homogeneous Poisson processes, synapse i i emitting a spike between time t and t + dt (si = 1) with constant probability qon dt if t i xt = 1, and another constant probability qoff dt if xt = 0. The synaptic spikes are assumed to be otherwise independent of previous synaptic spikes, previous states and spikes at other synapses. The resulting generative model is a hidden Markov chain (figure 1-A). However, rather than estimating the state of its hidden variable and communicating this estimate to other neurons (for example by emitting a spike when sensory evidence for xt = 1 goes above a threshold) the neuron reports and communicates its certainty that the current state is 1. This certainty takes the form of the log of the ratio of the probability that the hidden state is 1, and the probability that the state is 0, given all the synaptic inputs P (xt =1|s0→t ) received so far: Lt = log P (xt =0|s0→t ) . We use s0→t as a short hand notation for the N synaptic inputs received at present and in the past. We will refer to it as the log odds ratio. Thanks to the conditional independencies assumed in the generative model, we can compute this Log odds ratio iteratively. Taking the limit as dt goes to zero, we get the following differential equation: ˙ L = ron 1 + e−L − roff 1 + eL + i wi δ(si − 1) − θ t B. A. xt ron .roff dt qon , qoff st xt ron .roff i t st dt s qon , qoff qon , qoff st dt xt j st Ot It Gt Ot Lt t t dt C. E. 2 0 -2 -4 D. 500 1000 1500 2000 2500 2 3000 Count Log odds 4 20 Lt 0 -2 0 500 1000 1500 2000 2500 Time Ot 3000 0 200 400 600 ISI Figure 1: A. Generative model for the synaptic input. B. Schematic representation of log odds ratio encoding and decoding. The dashed circle represents both eventual downstream elements and the self-prediction taking place inside the model neuron. A spike is fired only when Lt exceeds Gt . C. One example trial, where the state switches from 0 to 1 (shaded area) and back to 0. plain: Lt , dotted: Gt . Black stripes at the top: corresponding spikes train. D. Mean Log odds ratio (dark line) and mean output firing rate (clear line). E. Output spike raster plot (1 line per trial) and ISI distribution for the neuron shown is C. and D. Clear line: ISI distribution for a poisson neuron with the same rate. wi , the synaptic weight, describe how informative synapse i is about the state of the hidden i qon variable, e.g. wi = log qi . Each synaptic spike (si = 1) gives an impulse to the log t off odds ratio, which is positive if this synapse is more active when the hidden state if 1 (i.e it increases the neuron’s confidence that the state is 1), and negative if this synapse is more active when xt = 0 (i.e it decreases the neuron’s confidence that the state is 1). The bias, θ, is determined by how informative it is not to receive any spike, e.g. θ = i i i qon − qoff . By convention, we will consider that the ”bias” is positive or zero (if not, we need simply to invert the status of the state x). 1.2 Generation of output spikes The spike train should convey a sparse representation of Lt , so that each spike reports new information about the state xt that is not redundant with that reported by other, preceding, spikes. This proposition is based on three arguments: First, spikes, being metabolically expensive, should be kept to a minimum. Second, spikes conveying redundant information would require a decoding of the entire spike train, whereas independent spike can be taken into account individually. And finally, we seek a self consistent model, with the spiking output having a similar semantics to its spiking input. To maximize the independence of the spikes (conditioned on xt ), we propose that the neuron fires only when the difference between its log odds ratio Lt and a prediction Gt of this log odds ratio based on the output spikes emitted so far reaches a certain threshold. Indeed, supposing that downstream elements predicts Lt as best as they can, the neuron only needs to fire when it expects that prediction to be too inaccurate (figure 1-B). In practice, this will happen when the neuron receives new evidence for xt = 1. Gt should thereby follow the same dynamics as Lt when spikes are not received. The equation for Gt and the output Ot (Ot = 1 when an output spike is fired) are given by: ˙ G = Ot = ron 1 + e−L − roff 1 + eL + go δ(Ot − 1) go 1. when Lt > Gt + , 0 otherwise, 2 (1) (2) Here go , a positive constant, is the only free parameter, the other parameters being constrained by the statistics of the synaptic input. 1.3 Results Figure 1-C plots a typical trial, showing the behavior of L, G and O before, during and after presentation of the stimulus. As random synaptic inputs are integrated, L fluctuates and eventually exceeds G + 0.5, leading to an output spike. Immediately after a spike, G jumps to G + go , which prevents (except in very rare cases) a second spike from immediately following the first. Thus, this ”jump” implements a relative refractory period. However, ron G decays as it tends to converge back to its stable level gstable = log roff . Thus L eventually exceeds G again, leading to a new spike. This threshold crossing happens more often during stimulation (xt = 1) as the net synaptic input alters to create a higher overall level of certainty, Lt . Mean Log odds ratio and output firing rate ¯ The mean firing rate Ot of the Bayesian neuron during presentation of its preferred stimulus (i.e. when xt switches from 0 to 1 and back to 0) is plotted in figure 1-D, together with the ¯ mean log posterior ratio Lt , both averaged over trials. Not surprisingly, the log-posterior ratio reflects the leaky integration of synaptic evidence, with an effective time constant that depends on the transition probabilities ron , roff . If the state is very stable (ron = roff ∼ 0), synaptic evidence is integrated over almost infinite time periods, the mean log posterior ratio tending to either increase or decrease linearly with time. In the example in figure 1D, the state is less stable, so ”old” synaptic evidence are discounted and Lt saturates. ¯ In contrast, the mean output firing rate Ot tracks the state of xt almost perfectly. This is because, as a form of predictive coding, the output spikes reflect the new synaptic i evidence, It = i δ(st − 1) − θ, rather than the log posterior ratio itself. In particular, the mean output firing rate is a rectified linear function of the mean input, e. g. + ¯ ¯ wi q i −θ . O= 1I= go i on(off) Analogy with a leaky integrate and fire neuron We can get an interesting insight into the computation performed by this neuron by linearizing L and G around their mean levels over trials. Here we reduce the analysis to prolonged, statistically stable periods when the state is constant (either ON or OFF). In this case, the ¯ ¯ mean level of certainty L and its output prediction G are also constant over time. We make the rough approximation that the post spike jump, go , and the input fluctuations are small ¯ compared to the mean level of certainty L. Rewriting Vt = Lt − Gt + go 2 as the ”membrane potential” of the Bayesian neuron: ˙ V = −kL V + It − ∆go − go Ot ¯ ¯ ¯ where kL = ron e−L + roff eL , the ”leak” of the membrane potential, depends on the overall ¯ level of certainty. ∆go is positive and a monotonic increasing function of go . A. s t1 dt s t1 s t1 dt B. C. x t1 x t3 dt x t3 x t3 dt x t1 x t1 x t1 x t2 x t3 x t1 … x tn x t3 x t2 … x tn … dt dt Lx2 D. x t2 dt s t2 dt x t2 s t2 x t2 dt s t2 dt Log odds 10 No inh -0.5 -1 -1 -1.5 -2 5 Feedback 500 1000 1500 2000 Tiger Stripes 0 -5 -10 500 1000 1500 2000 2500 Time Figure 2: A. Bayesian causal network for yt (tiger), x1 (stripes) and x2 (paws). B. A nett t work feedforward computing the log posterior for x1 . C. A recurrent network computing t the log posterior odds for all variables. D. Log odds ratio in a simulated trial with the net2 1 1 work in C (see text). Thick line: Lx , thin line: Lx , dash-dotted: Lx without inhibition. t t t 2 Insert: Lx averaged over trials, showing the effect of feedback. t The linearized Bayesian neuron thus acts in its stable regime as a leaky integrate and fire (LIF) neuron. The membrane potential Vt integrates its input, Jt = It − ∆go , with a leak kL . The neuron fires when its membrane potential reaches a constant threshold go . After ¯ each spikes, Vt is reset to 0. Interestingly, for appropriately chosen compression factor go , the mean input to the lin¯ ¯ earized neuron J = I − ∆go ≈ 0 1 . This means that the membrane potential is purely driven to its threshold by input fluctuations, or a random walk in membrane potential. As a consequence, the neuron’s firing will be memoryless, and close to a Poisson process. In particular, we found Fano factor close to 1 and quasi-exponential ISI distribution (figure 1E) on the entire range of parameters tested. Indeed, LIF neurons with balanced inputs have been proposed as a model to reproduce the statistics of real cortical neurons [8]. This balance is implemented in our model by the neuron’s effective self-inhibition, even when the synaptic input itself is not balanced. Decoding As we previously said, downstream elements could predict the log odds ratio Lt by computing Gt from the output spikes (Eq 1, fig 1-B). Of course, this requires an estimate of the transition probabilities ron , roff , that could be learned from the observed spike trains. However, we show next that explicit decoding is not necessary to perform bayesian inference in spiking networks. Intuitively, this is because the quantity that our model neurons receive and transmit, eg new information, is exactly what probabilistic inference algorithm propagate between connected statistical elements. 1 ¯ Even if go is not chosen optimally, the influence of the drift J is usually negligible compared to the large fluctuations in membrane potential. 2 Bayesian inference in cortical networks The model neurons, having the same input and output semantics, can be used as building blocks to implement more complex generative models consisting of coupled Markov chains. Consider, for example, the example in figure 2-A. Here, a ”parent” variable x1 t (the presence of a tiger) can cause the state of n other ”children” variables ([xk ]k=2...n ), t of whom two are represented (the presence of stripes,x2 , and motion, x3 ). The ”chilt t dren” variables are Bayesian neurons identical to those described previously. The resulting bayesian network consist of n + 1 coupled hidden Markov chains. Inference in this architecture corresponds to computing the log posterior odds ratio for the tiger, x1 , and the log t posterior of observing stripes or motion, ([xk ]k=2...n ), given the synaptic inputs received t by the entire network so far, i.e. s2 , . . . , sk . 0→t 0→t Unfortunately, inference and learning in this network (and in general in coupled Markov chains) requires very expensive computations, and cannot be performed by simply propagating messages over time and among the variable nodes. In particular, the state of a child k variable xt depends on xk , sk , x1 and the state of all other children at the previous t t t−dt time step, [xj ]2
6 0.15507647 77 nips-2004-Hierarchical Clustering of a Mixture Model
7 0.15484004 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
8 0.13063225 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities
9 0.12225169 163 nips-2004-Semi-parametric Exponential Family PCA
10 0.1220004 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
11 0.11852661 161 nips-2004-Self-Tuning Spectral Clustering
12 0.11766431 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
13 0.11229253 115 nips-2004-Maximum Margin Clustering
14 0.10938415 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
15 0.1093149 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
16 0.10757827 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
17 0.10397956 153 nips-2004-Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity
18 0.099056959 173 nips-2004-Spike-timing Dependent Plasticity and Mutual Information Maximization for a Spiking Neuron Model
19 0.093613401 44 nips-2004-Conditional Random Fields for Object Recognition
20 0.090879157 118 nips-2004-Methods for Estimating the Computational Power and Generalization Capability of Neural Microcircuits
topicId topicWeight
[(0, -0.293), (1, -0.162), (2, -0.115), (3, -0.132), (4, -0.043), (5, -0.082), (6, -0.098), (7, 0.114), (8, -0.052), (9, 0.061), (10, -0.027), (11, 0.238), (12, -0.012), (13, -0.007), (14, 0.02), (15, -0.056), (16, 0.134), (17, 0.021), (18, -0.068), (19, 0.001), (20, -0.069), (21, -0.051), (22, -0.099), (23, 0.133), (24, 0.055), (25, 0.024), (26, -0.127), (27, 0.036), (28, 0.045), (29, -0.097), (30, 0.061), (31, -0.021), (32, 0.044), (33, -0.012), (34, 0.021), (35, 0.034), (36, -0.025), (37, -0.046), (38, 0.047), (39, -0.007), (40, -0.106), (41, -0.042), (42, -0.066), (43, -0.035), (44, 0.036), (45, -0.037), (46, -0.013), (47, 0.167), (48, -0.023), (49, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.96231198 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
Author: Aharon Bar-hillel, Adam Spiro, Eran Stark
Abstract: Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. It is a complicated problem, which requires a lot of human labor, partly due to the non-stationary nature of the data. We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. At a first search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. Transition probabilities are computed using local stationarity assumptions and are based on a Gaussian version of the Jensen-Shannon divergence. The method was applied to several recordings. The performance appeared almost indistinguishable from humans in a wide range of scenarios, including movement, merges, and splits of clusters. 1
2 0.60947508 77 nips-2004-Hierarchical Clustering of a Mixture Model
Author: Jacob Goldberger, Sam T. Roweis
Abstract: In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1
3 0.58271945 112 nips-2004-Maximising Sensitivity in a Spiking Network
Author: Anthony J. Bell, Lucas C. Parra
Abstract: We use unsupervised probabilistic machine learning ideas to try to explain the kinds of learning observed in real neurons, the goal being to connect abstract principles of self-organisation to known biophysical processes. For example, we would like to explain Spike TimingDependent Plasticity (see [5,6] and Figure 3A), in terms of information theory. Starting out, we explore the optimisation of a network sensitivity measure related to maximising the mutual information between input spike timings and output spike timings. Our derivations are analogous to those in ICA, except that the sensitivity of output timings to input timings is maximised, rather than the sensitivity of output ‘firing rates’ to inputs. ICA and related approaches have been successful in explaining the learning of many properties of early visual receptive fields in rate coding models, and we are hoping for similar gains in understanding of spike coding in networks, and how this is supported, in principled probabilistic ways, by cellular biophysical processes. For now, in our initial simulations, we show that our derived rule can learn synaptic weights which can unmix, or demultiplex, mixed spike trains. That is, it can recover independent point processes embedded in distributed correlated input spike trains, using an adaptive single-layer feedforward spiking network. 1 Maximising Sensitivity. In this section, we will follow the structure of the ICA derivation [4] in developing the spiking theory. We cannot claim, as before, that this gives us an information maximisation algorithm, for reasons that we will delay addressing until Section 3. But for now, to first develop our approach, we will explore an interim objective function called sensitivity which we define as the log Jacobian of how input spike timings affect output spike timings. 1.1 How to maximise the effect of one spike timing on another. Consider a spike in neuron j at time tl that has an effect on the timing of another spike in neuron i at time tk . The neurons are connected by a weight wij . We use i and j to index neurons, and k and l to index spikes, but sometimes for convenience we will use spike indices in place of neuron indices. For example, wkl , the weight between an input spike l and an output spike k, is naturally understood to be just the corresponding wij . dtk dtl threshold potential du u(t) R(t) resting potential tk output spikes tl input spikes Figure 1: Firing time tk is determined by the time of threshold crossing. A change of an input spike time dtl affects, via a change of the membrane potential du the time of the output spike by dtk . In the simplest version of the Spike Response Model [7], spike l has an effect on spike k that depends on the time-course of the evoked EPSP or IPSP, which we write as R kl (tk − tl ). In general, this Rkl models both synaptic and dendritic linear responses to an input spike, and thus models synapse type and location. For learning, we need only consider the value of this function when an output spike, k, occurs. In this model, depicted in Figure 1, a neuron adds up its spiking inputs until its membrane potential, ui (t), reaches threshold at time tk . This threshold we will often, again for convenience, write as uk ≡ ui (tk , {tl }), and it is given by a sum over spikes l: uk = wkl Rkl (tk − tl ) . (1) l To maximise timing sensitivity, we need to determine the effect of a small change in the input firing time tl on the output firing time tk . (A related problem is tackled in [2].) When tl is changed by a small amount dtl the membrane potential will change as a result. This change in the membrane potential leads to a change in the time of threshold crossing dt k . The contribution to the membrane potential, du, due to dtl is (∂uk /∂tl )dtl , and the change in du corresponding to a change dtk is (∂uk /∂tk )dtk . We can relate these two effects by noting that the total change of the membrane potential du has to vanish because u k is defined as the potential at threshold. ie: du = ∂uk ∂uk dtk + dtl = 0 . ∂tk ∂tl (2) This is the total differential of the function uk = u(tk , {tl }), and is a special case of the implicit function theorem. Rearranging this: dtk ∂uk =− dtl ∂tl ∂uk ˙ = −wkl Rkl /uk . ˙ ∂tk (3) Now, to connect with the standard ICA derivation [4], recall the ‘rate’ (or sigmoidal) neuron, for which yi = gi (ui ) and ui = j wij xj . For this neuron, the output dependence on input is ∂yi /∂xj = wij gi while the learning gradient is: ∂yi ∂ 1 log − fi (ui )xj = ∂wij ∂xj wij (4) where the ‘score functions’, fi , are defined in terms of a density estimate on the summed ∂ ∂ inputs: fi (ui ) = ∂ui log gi = ∂ui log p(ui ). ˆ The analogous learning gradient for the spiking case, from (3), is: ˙ j(a)Rka ∂ dtk 1 log − a . = ∂wij dtl wij uk ˙ (5) where j(a) = 1 if spike a came from neuron j, and 0 otherwise. Comparing the two cases in (4) and (5), we see that the input variable xj has become the temporal derivative of the sum of the EPSPs coming from synapse j, and the output variable (or score function) fi (ui ) has become u−1 , the inverse of the temporal derivative ˙k of the membrane potential at threshold. It is intriguing (A) to see this quantity appear as analogous to the score function in the ICA likelihood model, and, (B) to speculate that experiments could show that this‘ voltage slope at threshold’ is a hidden factor in STDP data, explaining some of the scatter in Figure 3A. In other words, an STDP datapoint should lie on a 2-surface in a 3D space of {∆w, ∆t, uk }. Incidentally, uk shows up in any ˙ ˙ learning rule optimising an objective function involving output spike timings. 1.2 How to maximise the effect of N spike timings on N other ones. Now we deal with the case of a ‘square’ single-layer feedforward mapping between spike timings. There can be several input and output neurons, but here we ignore which neurons are spiking, and just look at how the input timings affect the output timings. This is captured in a Jacobian matrix of all timing dependencies we call T. The entries of this matrix are Tkl ≡ ∂tk /∂tl . A multivariate version of the sensitivity measure introduced in the previous section is the log of the absolute determinant of the timing matrix, ie: log |T|. The full derivation for the gradient W log |T| is in the Appendix. Here, we again draw out the analogy between Square ICA [4] and this gradient, as follows. Square ICA with a network y = g(Wx) is: ∆W ∝ W log |J| = W−1 − f (u)xT (6) where the Jacobian J has entries ∂yi /∂xj and the score functions are now, fi (u) = ∂ − ∂ui log p(u) for the general likelihood case, with p(u) = i gi being the special case of ˆ ˆ ICA. We will now split the gradient in (6) according to the chain rule: W log |J| = [ J log |J|] ⊗ [ W J] j(l) − fk (u)xj wkl J−T ⊗ Jkl i(k) = (7) . (8) In this equation, i(k) = δik and j(l) = δjl . The righthand term is a 4-tensor with entries ∂Jkl /∂wij , and ⊗ is defined as A ⊗ Bij = kl Akl Bklij . We write the gradient this way to preserve, in the second term, the independent structure of the 1 → 1 gradient term in (4), and to separate a difficult derivation into two easy parts. The structure of (8) holds up when we move to the spiking case, giving: W log |T| = = [ T log |T|] ⊗ [ W T] T−T ⊗ Tkl i(k) j(l) − wkl (9) a ˙ j(a)Rka uk ˙ (10) where i(k) is now defined as being 1 if spike k occured in neuron i, and 0 otherwise. j(l) and j(a) are analogously defined. Because the T matrix is much bigger than the J matrix, and because it’s entries are more complex, here the similarity ends. When (10) is evaluated for a single weight influencing a single spike coupling (see the Appendix for the full derivation), it yields: ∆wkl ∝ ∂ log |T| Tkl = ∂wkl wkl T−1 lk −1 , (11) This is a non-local update involving a matrix inverse at each step. In the ICA case of (6), such an inverse was removed by the Natural Gradient transform (see [1]), but in the spike timing case, this has turned out not to be possible, because of the additional asymmetry ˙ introduced into the T matrix (as opposed to the J matrix) by the Rkl term in (3). 2 Results. Nonetheless, this learning rule can be simulated. It requires running the network for a while to generate spikes (and a corresponding T matrix), and then for each input/output spike coupling, the corresponding synapse is updated according to (11). When this is done, and the weights learn, it is clear that something has been sacrificed by ignoring the issue of which neurons are producing the spikes. Specifically, the network will often put all the output spikes on one output neuron, with the rates of the others falling to zero. It is happy to do this, if a large log |T| can thereby be achieved, because we have not included this ‘which neuron’ information in the objective. We will address these and other problems in Section 3, but now we report on our simulation results on demultiplexing. 2.1 Demultiplexing spike trains. An interesting possibility in the brain is that ‘patterns’ are embedded in spatially distributed spike timings that are input to neurons. Several patterns could be embedded in single input trains. This is called multiplexing. To extract and propagate these patterns, the neurons must demultiplex these inputs using its threshold nonlinearity. Demultiplexing is the ‘point process’ analog of the unmixing of independent inputs in ICA. We have been able to robustly achieve demultiplexing, as we now report. We simulated a feed-forward network with 3 integrate-and-fire neurons and inputs from 3 presynaptic neurons. Learning followed (11) where we replace the inverse by the pseudoinverse computed on the spikes generated during 0.5 s. The pseudo-inverse is necessary because even though on average, the learning matches number of output spikes to number of input spikes, the matrix T is still not usually square and so its actual inverse cannot be taken. In addition, in these simulations, an additional term is introduced in the learning to make sure all the output neurons fire with equal probability. This partially counters the ignoral of the ‘which neuron’ information, which we explained above. Assuming Poisson spike count ni for the ith output neuron with equal firing rate ni it is easy to derive in an approximate ¯ term that will control the spike count, i (¯ i − ni ). The target firing rates ni were set to n ¯ match the “source” spike train in this example. The network learns to demultiplex mixed spike trains, as shown in Figure 2. This demultiplexing is a robust property of learning using (11) with this new spike-controlling term. Finally, what about the spike-timing dependendence of the observed learning? Does it match experimental results? The comparison is made in Figure 3, and the answer is no. There is a timing-dependent transition between depression and potentiation in our result Spike Trains mixing mixed input trains 1 1 0.8 2 0.6 3 0 50 100 150 200 250 300 350 400 450 0.4 500 0.2 output 1 0 2 3 synaptic weights 0 50 100 150 200 250 300 350 400 450 500 original spike train 1 1 0.5 2 0 3 0 50 100 150 200 250 time in ms 300 350 400 450 500 −0.5 Figure 2: Unmixed spike trains. The input (top lef) are 3 spike trains which are a mixture of three independent Poison processes (bottom left). The network unmixes the spike train to approximately recover the original (center left). In this example 19 spikes correspond to the original with 4 deletion and 2 insertions. The two panels at the right show the mixing (top) and synaptic weight matrix after training (bottom). in Figure 3B, but it is not a sharp transition like the experimental result in Figure 3A. In addition, it does not transition at zero (ie: when tk − tl = 0), but at a time offset by the rise time of the EPSPs. In earlier experiments, in which we tranformed the gradient in (11) by an approximate inverse Hessian, to get an approximate Natural Gradient method, a sharp transition did emerge in simulations. However, the approximate inverse Hessian was singular, and we had to de-emphasise this result. It does suggest, however, that if the Natural Gradient transform can be usefully done on some variant of this learning rule, it may well be what accounts for the sharp transition effect of STDP. 3 Discussion Although these derivations started out smoothly, the reader possibly shares the authors’ frustration at the approximations involved here. Why isn’t this simple, like ICA? Why don’t we just have a nice maximum spikelihood model, ie: a density estimation algorithm for multivariate point processes, as ICA was a model in continuous space? We are going to be explicit about the problems now, and will propose a direction where the solution may lie. The over-riding problem is: we are unable to claim that in maximising log |T|, we are maximising the mutual information between inputs and outputs because: 1. The Invertability Problem. Algorithms such as ICA which maximise log Jacobians can only be called Infomax algorithms if the network transformation is both deterministic and invertable. The Spike Response Model is deterministic, but it is not invertable in general. When not invertable, the key formula (considering here vectors of input and output timings, tin and tout )is transformed from simple to complex. ie: p(tout ) = p(tin ) becomes p(tout ) = |T| solns tin p(tin ) d tin |T| (12) Thus when not invertable, we need to know the Jacobians of all the inputs that could have caused an output (called here ‘solns’), something we simply don’t know. 2. The ‘Which Neuron’ Problem. Instead of maximising the mutual information I(tout , tin ), we should be maximising I(tiout , tiin ), where the vector ti is the timing (A) STDP (B) Gradient 100 ∆ w (a.u.) 150 100 ∆ w / w (%) 150 50 0 −50 −100 −100 50 0 −50 −50 0 ∆ t (ms) 50 100 −100 −20 0 20 40 60 ∆ t (ms) 80 100 Figure 3: Dependence of synaptic modification on pre/post inter-spike interval. Left (A): From Froemke & Dan, Nature (2002)]. Dependence of synaptic modification on pre/post inter-spike interval in cat L2/3 visual cortical pyramidal cells in slice. Naturalistic spike trains. Each point represents one experiment. Right (B): According to Equation (11). Each point corresponds to an spike pair between approximately 100 input and 100 output spikes. vector, t, with the vector, i, of corresponding neuron indices, concatenated. Thus, ‘who spiked?’ should be included in the analysis as it is part of the information. 3. The Predictive Information Problem. In ICA, since there was no time involved, we did not have to worry about mutual informations over time between inputs and outputs. But in the spiking model, output spikes may well have (predictive) mutual information with future input spikes, as well as the usual (causal) mutual information with past input spikes. The former has been entirely missing from our analysis so far. These temporal and spatial infomation dependencies missing in our analysis so far, are thrown into a different light by a single empirical observation, which is that Spike TimingDependent Plasticity is not just a feedforward computation like the Spike Response Model. Specifically, there must be at least a statistical, if not a causal, relation between a real synapse’s plasticity and its neuron’s output spike timings, for Figure 3B to look like it does. It seems we have to confront the need for both a ‘memory’ (or reconstruction) model, such as the T we have thus far dealt with, in which output spikes talk about past inputs, and a ‘prediction’ model, in which they talk about future inputs. This is most easily understood from the point of view of Barber & Agakov’s variational Infomax algorithm [3]. They argue for optimising a lower bound on mutual information, which, for our neurons’, would be expressed using an inverse model p, as follows: ˆ I(tiin , tiout ) = H(tiin ) − log p(tiin |tiout ) ˆ p(tiin ,tiout ) ≤ I(tiin , tiout ) (13) In a feedforward model, H(tiin ) may be disregarded in taking gradients, leading us to the optimisation of a ‘memory-prediction’ model p(tiin |tiout ) related to something supposˆ edly happening in dendrites, somas and at synapses. In trying to guess what this might be, it would be nice if the math worked out. We need a square Jacobian matrix, T, so that |T| = p(tiin |tiout ) can be our memory/prediction model. Now let’s rename our feedforˆ → − ward timing Jacobian T (‘up the dendritic trees’), as T, and let’s fantasise that there is ← − some, as yet unspecified, feedback Jacobian T (‘down the dendritic trees’), which covers → − electrotonic influences as they spread from soma to synapse, and which T can be combined with by some operation ‘⊗’ to make things square. Imagine further, that doing this → ← − − yields a memory/prediction model on the inputs. Then the T we are looking for is T ⊗ T, → ← − − and the memory-prediction model is: p(tiin |tiout ) = T ⊗ T ˆ → − → − Ideally, the entries of T should be as before, ie: T kl = ∂tk /∂tl . What should the entries ← − ← − ← − of T be? Becoming just one step more concrete, suppose T had entries T lk = ∂cl /∂tk , where cl is some, as yet unspecified, value, or process, occuring at an input synapse when spike l comes in. What seems clear is that ⊗ should combine the correctly tensorised forms → − ← − → ← − − of T and T (giving them each 4 indices ijkl), so that T = T ⊗ T sums over the spikes k and l to give a I × J matrix, where I is the number of output neurons, and J the number of input neurons. Then our quantity, T, would represent all dependencies of input neuronal activity on output activity, summed over spikes. ← − Further, we imagine that T contains reverse (feedback) electrotonic transforms from soma ← − to synapse R lk that are somehow symmetrically related to the feedforward Spike Re→ − sponses from synapse to soma, which we now rename R kl . Thinking for a moment in terms of somatic k and synaptic l, voltages V , currents I and linear cable theory, the synapse to → − → − soma transform, R kl would be related to an impedance in Vk = Il Z kl , while the soma ← − ← − to synapse transform, R lk would be related to an admittance in Il = Vk Y lk [8]. The → − ← − symmetry in these equations is that Z kl is just the inverse conjugate of Y lk . Finally, then, what is cl ? And what is its relation to the calcium concentration, [Ca2+ ]l , at a synapse, when spike l comes in? These questions naturally follow from considering the experimental data, since it is known that the calcium level at synapses is the critical integrating factor in determining whether potentiation or depression occurs [5]. 4 Appendix: Gradient of log |T| for the full Spike Response Model. Here we give full details of the gradient for Gerstner’s Spike Response Model [7]. This is a general model for which Integrate-and-Fire is a special case. In this model the effect of a presynaptic spike at time tl on the membrane potential at time t is described by a post synaptic potential or spike response, which may also depend on the time that has passed since the last output spike tk−1 , hence the spike response is written as R(t − tk−1 , t − tl ). This response is weighted by the synaptic strength wl . Excitatory or inhibitory synapses are determined by the sign of wl . Refractoriness is incorporated by adding a hyper-polarizing contribution (spike-afterpotential) to the membrane potential in response to the last preceding spike η(t − tk−1 ). The membrane potential as a function of time is therefore given by u(t) = η(t − tk−1 ) + wl R(t − tk−1 , t − tl ) . (14) l We have ignored here potential contributions from external currents which can easily be included without modifying the following derivations. The output firing times t k are defined as the times for which u(t) reaches firing threshold from below. We consider a dynamic threshold, ϑ(t − tk−1 ), which may depend on the time since that last spike tk−1 , together then output spike times are defined implicitly by: t = tk : u(t) = ϑ(t − tk−1 ) and du(t) > 0. dt (15) For this more general model Tkl is given by Tkl = dtk =− dtl ∂u ∂ϑ − ∂tk ∂tk −1 ˙ ∂u wkl R(tk − tk−1 , tk − tl , ) = , ˙ ∂tl u(tk ) − ϑ(tk − tk−1 ) ˙ (16) ˙ ˙ where R(s, t), u(t), and ϑ(t) are derivatives with respect to t. The dependence of Tkl on ˙ tk−1 should be implicitly assumed. It has been omitted to simplify the notation. Now we compute the derivative of log |T| with respect to wkl . For any matrix T we have ∂ log |T|/∂Tab = [T−1 ]ba . Therefore: ∂ log |T| ∂Tab ∂ log |T| ∂Tab = [T−1 ]ba . (17) ∂wkl ∂Tab ∂wkl ∂wkl ab ab Utilising the Kronecker delta δab = (1 if a = b, else 0), the derivative of (16) with respect to wkl gives: ˙ ∂Tab ∂ wab R(ta − ta−1 , ta − tb ) = ˙ ˙ ∂wkl ∂wkl η(ta − ta−1 ) + wac R(ta − ta−1 , ta − tc ) − ϑ(ta − ta−1 ) c ˙ R(ta − ta−1 , ta − tb ) = δak δbl ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ ˙ ˙ wab R(ta − ta−1 , ta − tb )δak R(ta − ta−1 , ta − tl ) − 2 ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ = δak Tab Therefore: ∂ log |T| ∂wkl Tal δbl − wab wal . (18) δbl Tal − wab wal [T−1 ]ba δak Tab = ab = Tkl wkl [T−1 ]lk − [T−1 ]bk Tkl b (19) = Tkl [T−1 ]lk − 1 . wkl (20) Acknowledgments We are grateful for inspirational discussions with Nihat Ay, Michael Eisele, Hong Hui Yu, Jim Crutchfield, Jeff Beck, Surya Ganguli, Sophi` Deneve, David Barber, Fabian Theis, e Tony Zador and Arunava Banerjee. AJB thanks all RNI colleagues for many such discussions. References [1] Amari S-I. 1997. Natural gradient works efficiently in learning, Neural Computation, 10, 251-276 [2] Banerjee A. 2001. On the Phase-Space Dynamics of Systems of Spiking Neurons. Neural Computation, 13, 161-225 [3] Barber D. & Agakov F. 2003. The IM Algorithm: A Variational Approach to Information Maximization. Advances in Neural Information Processing Systems 16, MIT Press. [4] Bell A.J. & Sejnowski T.J. 1995. An information maximization approach to blind separation and blind deconvolution, Neural Computation, 7, 1129-1159 [5] Dan Y. & Poo M-m. 2004. Spike timing-dependent plasticity of neural circuits, Neuron, 44, 23-30 [6] Froemke R.C. & Dan Y. 2002. Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 28, 416: 433-8 [7] Gerstner W. & Kistner W.M. 2002. Spiking neuron models, Camb. Univ. Press [8] Zador A.M., Agmon-Snir H. & Segev I. 1995. The morphoelectrotonic transform: a graphical approach to dendritic function, J. Neurosci., 15(3): 1669-1682
4 0.57524139 158 nips-2004-Sampling Methods for Unsupervised Learning
Author: Rob Fergus, Andrew Zisserman, Pietro Perona
Abstract: We present an algorithm to overcome the local maxima problem in estimating the parameters of mixture models. It combines existing approaches from both EM and a robust fitting algorithm, RANSAC, to give a data-driven stochastic learning scheme. Minimal subsets of data points, sufficient to constrain the parameters of the model, are drawn from proposal densities to discover new regions of high likelihood. The proposal densities are learnt using EM and bias the sampling toward promising solutions. The algorithm is computationally efficient, as well as effective at escaping from local maxima. We compare it with alternative methods, including EM and RANSAC, on both challenging synthetic data and the computer vision problem of alpha-matting. 1
5 0.57070488 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities
Author: Lavi Shpigelman, Koby Crammer, Rony Paz, Eilon Vaadia, Yoram Singer
Abstract: We devise and experiment with a dynamical kernel-based system for tracking hand movements from neural activity. The state of the system corresponds to the hand location, velocity, and acceleration, while the system’s input are the instantaneous spike rates. The system’s state dynamics is defined as a combination of a linear mapping from the previous estimated state and a kernel-based mapping tailored for modeling neural activities. In contrast to generative models, the activity-to-state mapping is learned using discriminative methods by minimizing a noise-robust loss function. We use this approach to predict hand trajectories on the basis of neural activity in motor cortex of behaving monkeys and find that the proposed approach is more accurate than both a static approach based on support vector regression and the Kalman filter. 1
6 0.54715979 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
7 0.53226525 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern
8 0.52853692 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
9 0.51873046 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
10 0.49592218 148 nips-2004-Probabilistic Computation in Spiking Populations
11 0.49419948 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
12 0.48432332 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
13 0.46395248 118 nips-2004-Methods for Estimating the Computational Power and Generalization Capability of Neural Microcircuits
14 0.45404822 28 nips-2004-Bayesian inference in spiking neurons
15 0.45401657 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
16 0.45226529 161 nips-2004-Self-Tuning Spectral Clustering
17 0.44348699 115 nips-2004-Maximum Margin Clustering
18 0.43554133 109 nips-2004-Mass Meta-analysis in Talairach Space
19 0.43280149 163 nips-2004-Semi-parametric Exponential Family PCA
20 0.41322348 29 nips-2004-Beat Tracking the Graphical Model Way
topicId topicWeight
[(13, 0.103), (15, 0.142), (26, 0.087), (31, 0.045), (33, 0.178), (35, 0.031), (37, 0.202), (39, 0.034), (50, 0.055), (52, 0.011), (71, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.87219536 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
Author: Aharon Bar-hillel, Adam Spiro, Eran Stark
Abstract: Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. It is a complicated problem, which requires a lot of human labor, partly due to the non-stationary nature of the data. We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. At a first search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. Transition probabilities are computed using local stationarity assumptions and are based on a Gaussian version of the Jensen-Shannon divergence. The method was applied to several recordings. The performance appeared almost indistinguishable from humans in a wide range of scenarios, including movement, merges, and splits of clusters. 1
2 0.79291344 69 nips-2004-Fast Rates to Bayes for Kernel Machines
Author: Ingo Steinwart, Clint Scovel
Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1
3 0.79125535 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
4 0.78641099 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
5 0.78187716 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1
6 0.77813131 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
7 0.7758013 178 nips-2004-Support Vector Classification with Input Data Uncertainty
8 0.77567381 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
9 0.77548194 28 nips-2004-Bayesian inference in spiking neurons
10 0.77460092 103 nips-2004-Limits of Spectral Clustering
11 0.77390176 70 nips-2004-Following Curved Regularized Optimization Solution Paths
12 0.77385485 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
13 0.77357978 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
14 0.77312523 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
15 0.77279389 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
16 0.77111757 163 nips-2004-Semi-parametric Exponential Family PCA
17 0.77045721 68 nips-2004-Face Detection --- Efficient and Rank Deficient
18 0.77015114 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
19 0.77001721 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
20 0.76999521 116 nips-2004-Message Errors in Belief Propagation