nips nips2011 nips2011-13 knowledge-graph by maker-knowledge-mining

13 nips-2011-A blind sparse deconvolution method for neural spike identification

Source: pdf

Author: Chaitanya Ekanadham, Daniel Tranchina, Eero P. Simoncelli

Abstract: We consider the problem of estimating neural spikes from extracellular voltage recordings. Most current methods are based on clustering, which requires substantial human supervision and systematically mishandles temporally overlapping spikes. We formulate the problem as one of statistical inference, in which the recorded voltage is a noisy sum of the spike trains of each neuron convolved with its associated spike waveform. Joint maximum-a-posteriori (MAP) estimation of the waveforms and spikes is then a blind deconvolution problem in which the coefﬁcients are sparse. We develop a block-coordinate descent procedure to approximate the MAP solution, based on our recently developed continuous basis pursuit method. We validate our method on simulated data as well as real data for which ground truth is available via simultaneous intracellular recordings. In both cases, our method substantially reduces the number of missed spikes and false positives when compared to a standard clustering algorithm, primarily by recovering overlapping spikes. The method offers a fully automated alternative to clustering methods that is less susceptible to systematic errors. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A blind deconvolution method for neural spike identiﬁcation Chaitanya Ekanadham Courant Institute New York University New York, NY 10012 chaitu@math. [sent-1, score-0.418]

2 Simoncelli Courant Institute Center for Neural Science Howard Hughes Medical Institute New York University New York, NY 10012 Abstract We consider the problem of estimating neural spikes from extracellular voltage recordings. [sent-4, score-0.524]

3 We formulate the problem as one of statistical inference, in which the recorded voltage is a noisy sum of the spike trains of each neuron convolved with its associated spike waveform. [sent-6, score-0.856]

4 Joint maximum-a-posteriori (MAP) estimation of the waveforms and spikes is then a blind deconvolution problem in which the coefﬁcients are sparse. [sent-7, score-0.878]

5 We validate our method on simulated data as well as real data for which ground truth is available via simultaneous intracellular recordings. [sent-9, score-0.178]

6 In both cases, our method substantially reduces the number of missed spikes and false positives when compared to a standard clustering algorithm, primarily by recovering overlapping spikes. [sent-10, score-0.692]

7 The method offers a fully automated alternative to clustering methods that is less susceptible to systematic errors. [sent-11, score-0.167]

8 1 Introduction The identiﬁcation of individual spikes in extracellularly recorded voltage traces is a critical step in the analysis of neural data for much of systems neuroscience. [sent-12, score-0.466]

9 Each spike appears with a stereotyped waveform, whose shape depends on the cell morphology, the ﬁltering properties of the medium and the electrode, and the cell’s position relative to the electrode. [sent-14, score-0.389]

10 The “spike sorting” problem is that of identifying distinct cells and their respective spike times. [sent-15, score-0.377]

11 This is a difﬁcult statistical inverse problem, since one typically does not know the number of cells, the shapes of their waveforms, or the frequency or temporal dynamics of their spike trains (see [1] for a review). [sent-16, score-0.429]

12 This sparse blind deconvolution problem arises in a variety of contexts other than spike sorting, including radar [5], seismology [6], and acoustic processing [7, 8]. [sent-18, score-0.418]

13 Most current approaches to spike sorting (with notable exceptions [9, 10]) can be summarized in three steps ([1, 2]): (1) identify segments of neural activity (e. [sent-19, score-0.58]

14 Segments within the same cluster are interpreted as spikes of a single neuron, whose waveform is estimated by the cluster centroid. [sent-28, score-0.653]

15 This method works well in identifying temporally isolated spikes whose waveforms are easily distinguishable from background noise and each other. [sent-29, score-0.813]

16 However it generally fails for segments containing more than one spike (either from the same or different neurons), because these segments do not lie close to the clusters of any individual cell [1]. [sent-30, score-0.655]

17 For example, an unresolved question in neuroscience is whether the occurrence of correlated or synchronous spikes carries specialized information [13, 14]. [sent-37, score-0.413]

18 A method that systematically fails for synchronous spikes (e. [sent-39, score-0.382]

19 Practitioners have developed a wide range of manual adjustments to overcome these limitations, from adjusting the electrode position to isolate a single neuron, to manually performing the clustering for spike identiﬁcation. [sent-43, score-0.499]

20 However, previous studies have shown that there is great variability in manual sorting results [17], and that human choices for cluster parameters are often suboptimal [18]. [sent-44, score-0.202]

21 This need is becoming ever more urgent as the use of multi-electrode arrays increases ([19]): manual parameter selection for a multi-dimensional clustering problem becomes more difﬁcult and time-consuming as the number of electrodes grows. [sent-46, score-0.164]

22 We formulate the spike sorting problem as a Bayesian estimation problem by incorporating a prior model for the spikes and assuming a linear-Gaussian model for the recording given the spikes [2, 4]. [sent-47, score-1.196]

23 Although the generative model is simple, inferring the spike times and waveforms is challenging. [sent-48, score-0.751]

24 We approximate the most likely spikes and waveform shapes given the recording (i. [sent-49, score-0.761]

25 the maximuma-posteriori, or MAP solution), by alternating between solving for the spike times while ﬁxing the waveforms and vice versa. [sent-51, score-0.751]

26 Solving for optimal spike times and amplitudes with ﬁxed waveform shapes is itself an NP-hard problem, and we employ a novel method called continuous basis pursuit [20, 21], combined with iterative reweighting techniques, to approximate its solution. [sent-52, score-0.864]

27 We compare our method with clustering on simulated and real data, demonstrating substantial reduction in spike identiﬁcation errors (both misses and false positives), particularly when spikes overlap in the signal. [sent-53, score-0.988]

28 2 Model of voltage trace The major deﬁciency of clustering is that each time segment is modeled as a noisy version of a single centered waveform rather than a noisy superposition of multiple, time-shifted waveforms. [sent-54, score-0.608]

29 A simple generative model for the observed voltage trace V (t) is summarized as follows: N V (t) Kn ani Wn (t − τni ) + η(t) = (1) n=1 i=1 {τni }Kn i=1 {ani }Kn i=1 ∼ Poisson Process(λn ) ∼ N (1, ǫ2 ) n n = 1, . [sent-55, score-0.445]

30 , N (2) In words, the spikes are a Poisson processes with known rates {λn } and amplitudes independently normally distributed about unity. [sent-61, score-0.489]

31 The trace is the sum of convolutions of the spikes with their respective waveforms W ≡ {Wn (t)}N along with Gaussian noise η(t) (note: other log-concave n=1 noise distributions can be used). [sent-62, score-0.91]

32 Here, Kn is the (Poisson-distributed) number of spikes of the n’th waveform in the signal. [sent-63, score-0.653]

33 Thus, the model accounts for superimposed spikes, variability in spike amplitude, as well as background noise. [sent-64, score-0.411]

34 Note also that since the model describes the full voltage trace, it does not require a 2 (a) 4 15 20 4 PC 2 0 1 7 −5 3 5 −10 −20 0 PC 1 6 3 1 5 2 6 5 7 PC 2 2 10 10 0 −10 −20 20 (b) (c) −10 0 PC 1 10 (d) Figure 1: Illustration of clustering on simulated data. [sent-66, score-0.266]

35 (c) The top-left plot shows the true waveforms used in this example. [sent-71, score-0.429]

36 The other plots indicate the waveforms whose projections are the black points in (b). [sent-72, score-0.429]

37 (d) Another example of simulated data with a single biphasic waveform (not shown). [sent-73, score-0.355]

38 The projections of the spikes can have a non-Gaussian distribution in PC space. [sent-74, score-0.353]

39 Two clusters arise because the waveform has two peaks around which the segments can be centered. [sent-75, score-0.459]

40 The priors on the spike trains account for the observed variability in spike amplitudes and average spike rates with minimal assumptions. [sent-79, score-1.181]

41 (1) boils down to solving: 1 V (t)− ani Wn (t−τni ) {ani },{τni },W 2 n,i min 2 2,Σ + n,i 1 (ani − 1)2 + log(2πǫ2 ) − log(λn ) n 2ǫ2 2 n (4) where x 2,Σ = Σ−1/2 x 2 and Σ is the noise covariance. [sent-84, score-0.297]

42 The simplest such representation uses a dictionary containing discretely time-shifted copies of the waveforms themselves {Wn (t − i∆)}n,i . [sent-87, score-0.466]

43 The functions Cn (t), Un (t), and V (t), as well as the constants rn and θn depend on the waveform Wn (t) and are explained in Fig. [sent-89, score-0.3]

44 5, with xni1 being the ampli∆ tude and 2θn atan (xni3 /xni2 ) being the time-shift associated with the waveform Wn (t) (see [21] for a detailed development of this approach). [sent-95, score-0.3]

45 Perform a rescaling xnij ← zn and Wn (t) ← zn Wn (t) where the zn ’s are chosen to optimize F ( xnij zn , {zn Wn (t)}). [sent-108, score-0.678]

46 The ﬁnal step minimizes the ﬁrst term with respect to the waveforms while keeping the second term constant, and amounts to an L2 -constrained least squares problem (ridge regression) that can be solved very efﬁciently. [sent-113, score-0.429]

47 2 Solve spikes given waveforms In this step we wish to minimize the function F (·, W) while ensuring that the solution lies in the convex set C. [sent-116, score-0.782]

48 3 Solve rescaling factors The ﬁrst term of F (x, {Wn (t)}) does not change by much if one divides the coefﬁcients xnij by some zn and multiplies the corresponding waveform by zn 1 . [sent-139, score-0.654]

49 In order to avoid the solution where the waveforms/coefﬁcients become arbitrarily large/small, respectively, we perform a rescaling in a separate step and then optimize the waveform shapes subject to a ﬁxed norm constraint (described in the next section). [sent-141, score-0.395]

50 Since the second term decomposes into terms that are each only dependent on one zn , we can independently solve the following scalar optimizations numerically: zn ← arg max z>0 i 1 xni1 xni1 log (1 − ∆λn ) e− zγ + ∆λn φ1,ǫ2 n γ z n = 1, . [sent-142, score-0.275]

51 , N (10) These are essentially maximum likelihood estimates of the scale factors given ﬁxed coefﬁcients and waveform shapes. [sent-145, score-0.3]

52 5 xnij ← Wn (t) ← xnij zn zn Wn (t) ∀n, i, j (11) ∀n (12) This step is guaranteed not to increase the objective in Eq. [sent-148, score-0.438]

53 4 Solve waveforms given spikes Given a set of coefﬁcients x, we can optimize waveform shapes by solving: min W: Wi (t) 2 ≤ki 1 V (t) − (ΦW x)(t) 2 2 2 (13) where ki is the current norm of Wi (t). [sent-151, score-1.181]

54 The constraints ensure that only the waveform shapes change (ideally, we would like the norm to be held ﬁxed, but we relax to to an inequality to retain convexity), leaving any changes in scale to the previous step. [sent-152, score-0.393]

55 The intracellular recording provides ground truth spikes for one of the cells in the extracellular recording. [sent-163, score-0.632]

56 1 Simulated data We obtained three waveforms from retinal recordings made in the Chichilnisky lab at the Salk Institute (shown in Fig. [sent-165, score-0.496]

57 Three Poisson spike trains were sampled independently with rate (1−ρ)λ0 with λ0 = 10Hz. [sent-167, score-0.364]

58 To introduce a correlation of ρ = 1 , we sampled another Poisson spike train with 3 rate ρλ0 and added these spikes (with random jitter) to each of the previous three trains. [sent-168, score-0.675]

59 The spikes were convolved with the waveforms and Gaussian white noise was added (with σ six times the smallest waveform amplitude). [sent-171, score-1.142]

60 To reduce computational cost, we applied our method to disjoint segments of the trace, which were split off whenever activity was less than 3σ for more than half the waveform duration (about 4ms). [sent-175, score-0.433]

61 The waveforms were initialized randomly and P (γ) was Gamma-distributed with mean 0. [sent-176, score-0.429]

62 The waveforms were allowed to change in length by adding (removing) padding on the ends on each iteration if the values exceeded (did not exceed) 5% of the peak amplitude (similar to [7]). [sent-180, score-0.657]

63 Padding was added in increments of 10% of the current waveform length. [sent-181, score-0.3]

64 The learned waveforms and spike amplitude distributions are shown in Fig. [sent-183, score-0.941]

65 The amplitude distributions are well-matched to the generative distributions (shown in red). [sent-185, score-0.19]

66 To evaluate performance, we counted missed spikes (relative to the number of true spikes) and false positives (relative to the number of predicted spikes) for clustering and our method. [sent-186, score-0.653]

67 We varied the segment-ﬁnding threshold for clustering, and the amplitude threshold for our algorithm. [sent-187, score-0.268]

68 To visualize the errors, we chose optimal thresholds for each method (yielding the smallest number of misses and false positives), and then projected all segments used in clustering onto the ﬁrst two principal components. [sent-190, score-0.36]

69 We indicate by dots, open circles, and crosses the hits, misses, and false 6 −10 −10 0 samples 4 2 0 0 10 4 2 0 0 10 20 amplitude (σ units) (a) 6 relative freq 0 6 relative freq 6 10 relative freq amplitude (σ units) positives, respectively (with colors indicating the waveform). [sent-191, score-0.729]

70 Note that unlike clustering, our method is allowed to assign more than one spike to each segment. [sent-193, score-0.322]

71 2 0 0 10 20 amplitude (σ units) (b) 4 10 20 amplitude (σ units) (c) (d) Figure 3: (a) Three waveforms used in simulations. [sent-196, score-0.809]

72 (b),(c),(d) Histograms of the spike amplitudes learned by our algorithm of the blue,green, and red waveforms, respectively. [sent-197, score-0.487]

73 The amplitudes were converted into units σ by multiplying them by the corresponding waveform amplitudes, then dividing by the noise standard deviation. [sent-198, score-0.521]

74 (b),(c) Visualization of spike sorting errors for clustering (b) and our method (c). [sent-204, score-0.576]

75 Dots represent segments whose composite spikes were all correctly identiﬁed, with the color specifying the waveform (see Fig. [sent-206, score-0.786]

76 The resulting waveforms and coefﬁcients histograms are shown in Figure 5. [sent-215, score-0.429]

77 Unlike the simulated example, the spike amplitude distributions are bimodal in nature, despite the prior amplitude distribution containing only one Gaussian. [sent-216, score-0.757]

78 We ﬁrst focus on the high-amplitude groups (2 and 4), both of which are well-separated from their low-amplitude counterparts (1 and 3), suggesting that an appropriately chosen threshold would provide accurate spike identiﬁcation for the ground-truth cell (4). [sent-217, score-0.473]

79 Figures 6(b) and 6(c) show that, as before, the majority of this reduction is accounted for by recovering spikes overlapping with those of another cell (group 2). [sent-219, score-0.459]

80 The low-amplitude groups (1 and 3) could arise from background cells whose waveforms look like scaled-down versions of those of the foreground cells 2 and 4, thus creating secondary “lumps” in the amplitude distributions. [sent-220, score-0.774]

81 4 2 3 0 0 10 20 amplitude (σ units) (a) ground truth cell 4 10 20 amplitude (σ units) (b) (c) Figure 5: (a) Two waveforms learned from CBP. [sent-223, score-0.947]

82 (b),(c) Distributions of the amplitude values for the blue and green waveform, respectively. [sent-224, score-0.19]

83 The numbers label distinct groups of amplitudes that could be treated as spikes of a single cell. [sent-225, score-0.534]

84 15 k=2 k=3 k=4 CBP 10 10 2 5 −5 0 0 5 10 15 Percent true spikes missed (a) 20 1 3 0 −5 −10 5 2 5 1 3 0 PC 2 15 15 Missed overlapping spikes 10 PC 2 Percent est. [sent-230, score-0.81]

85 spikes false 20 −10 −15 −10 4 0 4 10 PC 1 20 (b) −15 −10 0 10 PC 1 20 (c) Figure 6: (a) Error tradeoff as in Fig. [sent-231, score-0.449]

86 (b) Illustration of clustering errors in PC-space, with k = 4 and a threshold corresponding to the large red dot in (a). [sent-234, score-0.197]

87 The numbers show the approximate location in PC-space of the amplitude groups demarcated in Figures 5(b) and 5(c). [sent-236, score-0.235]

88 5 Discussion We have formulated the spike sorting problem as a maximum-a-posteriori (MAP) estimation problem, assuming a linear-Gaussian likelihood of the observed trace given the spikes and a Poisson process prior on the spikes. [sent-237, score-0.866]

89 Unlike clustering methods, the model explicitly accounts for overlapping spikes, translation-invariance, and variability in spike amplitudes. [sent-238, score-0.496]

90 , [10]), our method jointly learns waveforms and spikes within a uniﬁed framework. [sent-241, score-0.782]

91 We showed empirically on simulated data that our method outperforms the standard clustering approach, particularly in the case of superimposed spikes. [sent-243, score-0.205]

92 We also showed that our method yields an improvement on a real data set with ground truth, despite the fact that there are similar waveform shapes with different amplitudes. [sent-244, score-0.4]

93 Our method has only a few parameters that are stable across a variety of conditions, thus addressing the need for an automated method for spike sorting that is not susceptible to systematic errors. [sent-246, score-0.516]

94 A review of methods for spike sorting: the detection and classiﬁcation of neural action potentials. [sent-250, score-0.322]

95 Simultaneous paired intracellular and tetrode recordings for evaluating the performance of spike sorting algorithms. [sent-257, score-0.597]

96 Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering. [sent-297, score-0.447]

97 Wavelet footprints for detection and sorting of extracellular neural action potentials. [sent-302, score-0.183]

98 Recording spikes from a large fraction of the ganglion cells in a retinal patch. [sent-324, score-0.47]

99 Accuracy of tetrode spike separation as determined by simultaneous intracellular and extracellular measurements. [sent-341, score-0.499]

100 Multiple neural spike train data analysis: state-ofthe-art and future challenges. [sent-347, score-0.322]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('waveforms', 0.429), ('spikes', 0.353), ('spike', 0.322), ('waveform', 0.3), ('ani', 0.266), ('amplitude', 0.19), ('wn', 0.153), ('amplitudes', 0.136), ('pc', 0.134), ('segments', 0.133), ('sorting', 0.125), ('xnij', 0.114), ('voltage', 0.113), ('ni', 0.107), ('zn', 0.105), ('clustering', 0.098), ('freq', 0.095), ('positives', 0.073), ('tetrode', 0.067), ('cell', 0.067), ('trace', 0.066), ('missed', 0.065), ('misses', 0.065), ('shapes', 0.065), ('false', 0.064), ('deconvolution', 0.061), ('extracellular', 0.058), ('tranchina', 0.057), ('simulated', 0.055), ('cells', 0.055), ('units', 0.054), ('cients', 0.054), ('intracellular', 0.052), ('superimposed', 0.052), ('courant', 0.05), ('wni', 0.05), ('kn', 0.049), ('poisson', 0.048), ('coef', 0.046), ('groups', 0.045), ('recording', 0.043), ('trains', 0.042), ('ekanadham', 0.041), ('reweighting', 0.041), ('susceptible', 0.041), ('un', 0.04), ('manual', 0.04), ('dots', 0.039), ('electrode', 0.039), ('overlapping', 0.039), ('figures', 0.039), ('threshold', 0.039), ('percent', 0.038), ('cbp', 0.038), ('jozsef', 0.038), ('padding', 0.038), ('pezaris', 0.038), ('rajat', 0.038), ('optimizations', 0.038), ('spacing', 0.038), ('dictionary', 0.037), ('variability', 0.037), ('cn', 0.037), ('retinal', 0.036), ('truth', 0.036), ('blind', 0.035), ('reweighted', 0.035), ('ground', 0.035), ('ki', 0.034), ('nonconvex', 0.034), ('henze', 0.033), ('circle', 0.033), ('acoustics', 0.033), ('icassp', 0.033), ('identi', 0.032), ('vn', 0.032), ('tradeoff', 0.032), ('errors', 0.031), ('noise', 0.031), ('neuroscience', 0.031), ('recordings', 0.031), ('superposition', 0.031), ('kenneth', 0.031), ('rescaling', 0.03), ('map', 0.03), ('red', 0.029), ('convolved', 0.029), ('cvx', 0.029), ('darrell', 0.029), ('harris', 0.029), ('synchronous', 0.029), ('systematic', 0.028), ('neuron', 0.028), ('held', 0.028), ('signals', 0.027), ('solve', 0.027), ('signal', 0.026), ('electrodes', 0.026), ('ganglion', 0.026), ('peaks', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000013 13 nips-2011-A blind sparse deconvolution method for neural spike identification

Author: Chaitanya Ekanadham, Daniel Tranchina, Eero P. Simoncelli

2 0.38343063 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data

Author: Bo Chen, David E. Carlson, Lawrence Carin

Abstract: Nonparametric Bayesian methods are developed for analysis of multi-channel spike-train data, with the feature learning and spike sorting performed jointly. The feature learning and sorting are performed simultaneously across all channels. Dictionary learning is implemented via the beta-Bernoulli process, with spike sorting performed via the dynamic hierarchical Dirichlet process (dHDP), with these two models coupled. The dHDP is augmented to eliminate refractoryperiod violations, it allows the “appearance” and “disappearance” of neurons over time, and it models smooth variation in the spike statistics. 1

3 0.16884144 302 nips-2011-Variational Learning for Recurrent Spiking Networks

Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner

Abstract: We derive a plausible learning rule for feedforward, feedback and lateral connections in a recurrent network of spiking neurons. Operating in the context of a generative model for distributions of spike sequences, the learning mechanism is derived from variational inference principles. The synaptic plasticity rules found are interesting in that they are strongly reminiscent of experimental Spike Time Dependent Plasticity, and in that they differ for excitatory and inhibitory neurons. A simulation conﬁrms the method’s applicability to learning both stationary and temporal spike patterns. 1

4 0.14249083 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

Author: Konrad Koerding, Ian Stevenson

Abstract: Synaptic plasticity underlies learning and is thus central for development, memory, and recovery from injury. However, it is often difﬁcult to detect changes in synaptic strength in vivo, since intracellular recordings are experimentally challenging. Here we present two methods aimed at inferring changes in the coupling between pairs of neurons from extracellularly recorded spike trains. First, using a generalized bilinear model with Poisson output we estimate time-varying coupling assuming that all changes are spike-timing-dependent. This approach allows model-based estimation of STDP modiﬁcation functions from pairs of spike trains. Then, using recursive point-process adaptive ﬁltering methods we estimate more general variation in coupling strength over time. Using simulations of neurons undergoing spike-timing dependent modiﬁcation, we show that the true modiﬁcation function can be recovered. Using multi-electrode data from motor cortex we then illustrate the use of this technique on in vivo data. 1

5 0.13605377 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models

Author: Skander Mensi, Richard Naud, Wulfram Gerstner

Abstract: Variability in single neuron models is typically implemented either by a stochastic Leaky-Integrate-and-Fire model or by a model of the Generalized Linear Model (GLM) family. We use analytical and numerical methods to relate state-of-theart models from both schools of thought. First we ﬁnd the analytical expressions relating the subthreshold voltage from the Adaptive Exponential Integrate-andFire model (AdEx) to the Spike-Response Model with escape noise (SRM as an example of a GLM). Then we calculate numerically the link-function that provides the ﬁring probability given a deterministic membrane potential. We ﬁnd a mathematical expression for this link-function and test the ability of the GLM to predict the ﬁring probability of a neuron receiving complex stimulation. Comparing the prediction performance of various link-functions, we ﬁnd that a GLM with an exponential link-function provides an excellent approximation to the Adaptive Exponential Integrate-and-Fire with colored-noise input. These results help to understand the relationship between the different approaches to stochastic neuron models. 1 Motivation When it comes to modeling the intrinsic variability in simple neuron models, we can distinguish two traditional approaches. One approach is inspired by the stochastic Leaky Integrate-and-Fire (LIF) hypothesis of Stein (1967) [1], where a noise term is added to the system of differential equations implementing the leaky integration to a threshold. There are multiple versions of such a stochastic LIF [2]. How the noise affects the ﬁring probability is also a function of the parameters of the neuron model. Therefore, it is important to take into account the reﬁnements of simple neuron models in terms of subthreshold resonance [3, 4], spike-triggered adaptation [5, 6] and non-linear spike 1 initiation [7, 5]. All these improvements are encompassed by the Adaptive Exponential Integrateand-Fire model (AdEx [8, 9]). The other approach is to start with some deterministic dynamics for the the state of the neuron (for instance the instantaneous distance from the membrane potential to the threshold) and link the probability intensity of emitting a spike with a non-linear function of the state variable. Under some conditions, this type of model is part of a greater class of statistical models called Generalized Linear Models (GLM [10]). As a single neuron model, the Spike Response Model (SRM) with escape noise is a GLM in which the state variable is explicitly the distance between a deterministic voltage and the threshold. The original SRM could account for subthreshold resonance, refractory effects and spike-frequency adaptation [11]. Mathematically similar models were developed independently in the study of the visual system [12] where spike-frequency adaptation has also been modeled [13]. Recently, this approach has retained increased attention since the probabilistic framework can be linked with the Bayesian theory of neural systems [14] and because Bayesian inference can be applied to the population of neurons [15]. In this paper, we investigate the similarity and differences between the state-of-the-art GLM and the stochastic AdEx. The motivation behind this work is to relate the traditional threshold neuron models to Bayesian theory. Our results extend the work of Plesser and Gerstner (2000) [16] since we include the non-linearity for spike initiation and spike-frequency adaptation. We also provide relationships between the parameters of the AdEx and the equivalent GLM. These precise relationships can be used to relate analog implementations of threshold models [17] to the probabilistic models used in the Bayesian approach. The paper is organized as follows: We ﬁrst describe the expressions relating the SRM state-variable to the parameters of the AdEx (Sect. 3.1) in the subthreshold regime. Then, we use numerical methods to ﬁnd the non-linear link-function that models the ﬁring probability (Sect. 3.2). We ﬁnd a functional form for the SRM link-function that best describes the ﬁring probability of a stochastic AdEx. We then compare the performance of this link-function with the often used exponential or linear-rectiﬁer link-functions (also called half-wave linear rectiﬁer) in terms of predicting the ﬁring probability of an AdEx under complex stimulus (Sect. 3.3). We ﬁnd that the exponential linkfunction yields almost perfect prediction. Finally, we explore the relations between the statistic of the noise and the sharpness of the non-linearity for spike initiation with the parameters of the SRM. 2 Presentation of the Models In this section we present the general formula for the stochastic AdEx model (Sect. 2.1) and the SRM (Sect 2.2). 2.1 The Stochastic Adaptive Exponential Integrate-and-Fire Model The voltage dynamics of the stochastic AdEx is given by: V −Θ ˙ τm V = El − V + ∆T exp − Rw + RI + R (1) ∆T τw w = a(V − El ) − w ˙ (2) where τm is the membrane time constant, El the reverse potential, R the membrane resistance, Θ is the threshold, ∆T is the shape factor and I(t) the input current which is chosen to be an Ornstein−Θ Uhlenbeck process with correlation time-constant of 5 ms. The exponential term ∆T exp( V∆T ) is a non-linear function responsible for the emission of spikes and is a diffusive white noise with standard deviation σ (i.e. ∼ N (0, σ)). Note that the diffusive white-noise does not imply white noise ﬂuctuations of the voltage V (t), the probability distribution of V (t) will depend on ∆T and Θ. The second variable, w, describes the subthreshold as well as the spike-triggered adaptation both ˆ parametrized by the coupling strength a and the time constant τw . Each time tj the voltage goes to inﬁnity, we assumed that a spike is emitted. Then the voltage is reset to a ﬁxed value Vr and w is increased by a constant value b. 2.2 The Generalized Linear Model In the SRM, The voltage V (t) is given by the convolution of the injected current I(t) with the membrane ﬁlter κ(t) plus the additional kernel η(t) that acts after each spikes (here we split the 2 spike-triggered kernel in two η(t) = ηv (t) + ηw (t) for reasons that will become clear later): V (t) = ˆ ˆ ηv (t − tj ) + ηw (t − tj ) El + [κ ∗ I](t) + (3) ˆ {tj } ˆ Then at each time tj a spike is emitted which results in a change of voltage described by η(t) = ηv (t) + ηw (t). Given the deterministic voltage, (Eq. 3) a spike is emitted according to the ﬁring intensity λ(V ): λ(t) = f (V (t)) (4) where f (·) is an arbitrary function called the link-function. Then the ﬁring behavior of the SRM depends on the choice of the link-function and its parameters. The most common link-function used to model single neuron activities are the linear-rectiﬁer and the exponential function. 3 Mapping In order to map the stochastic AdEx to the SRM we follow a two-step procedure. First we derive the ﬁlter κ(t) and the kernels ηv (t) and ηw (t) analytically as a function of AdEx parameters. Second, we derive the link-function of the SRM from the stochastic spike emission of the AdEx. Figure 1: Mapping of the subthreshold dynamics of an AdEx to an equivalent SRM. A. Membrane ﬁlter κ(t) for three different sets of parameters of the AdEx leading to over-damped, critically damped and under-damped cases (upper, middle and lower panel, respectively). B. Spike-Triggered η(t) (black), ηv (t) (light gray) and ηw (gray) for the three cases. C. Example of voltage trace produced when an AdEx is stimulated with a step of colored noise (black). The corresponding voltage from a SRM stimulated with the same current and where we forced the spikes to match those of the AdEx (red). D. Error in the subthreshold voltage (VAdEx − VGLM ) as a function of the mean voltage of the AdEx, for the three different cases: over-, critically and under-damped (light gray, gray and black, respectively) with ∆T = 1 mV. Red line represents the voltage threshold Θ. E. Root Mean Square Error (RMSE) ratio for the three cases with ∆T = 1 mV. The RMSE ratio is the RMSE between the deterministic VSRM and the stochastic VAdEx divided by the RMSE between repetitions of the stochastic AdEx voltage. The error bar shows a single standard deviation as the RMSE ratio is averaged accross multiple value of σ. 3.1 Subthreshold voltage dynamics We start by assuming that the non-linearity for spike initiation does not affect the mean subthreshold voltage of the stochastic AdEx (see Figure 1 D). This assumption is motivated by the small ∆T 3 observed in in-vitro recordings (from 0.5 to 2 mV [8, 9]) which suggest that the subthreshold dynamics are mainly linear except very close to Θ. Also, we expect that the non-linear link-function will capture some of the dynamics due to the non-linearity for spike initiation. Thus it is possible to rewrite the deterministic subthreshold part of the AdEx (Eq. 1-2 without and without ∆T exp((V − Θ)/∆T )) using matrices: ˙ x = Ax (5) with x = V w and A = − τ1 m a τw − gl1m τ − τ1 w (6) In this form, the dynamics of the deterministic AdEx voltage is a damped oscillator with a driving force. Depending on the eigenvalues of A the system could be over-damped, critically damped or under-damped. The ﬁlter κ(t) of the GLM is given by the impulse response of the system of coupled differential equations of the AdEx, described by Eq. 5 and 6. In other words, one has to derive the response of the system when stimulating with a Dirac-delta function. The type of damping gives three different qualitative shapes of the kernel κ(t), which are summarized in Table 3.1 and Figure 1 A. Since the three different ﬁlters also affect the nature of the stochastic voltage ﬂuctuations, we will keep the distinction between over-damped, critically damped and under-damped scenarios throughout the paper. This means that our approach is valid for at least 3 types of diffusive voltage-noise (i.e. the white noise in Eq. 1 ﬁltered by 3 different membrane ﬁlters κ(t)). To complete the description of the deterministic voltage, we need an expression for the spiketriggered kernels. The voltage reset at each spike brings a spike-triggered jump in voltage of magˆ nitude ∆ = Vr − V (t). This perturbation is superposed to the current ﬂuctuations due to I(t) and can be mediated by a Delta-diract pulse of current. Thus we can write the voltage reset kernel by: ηv (t) = ∆ ∆ [δ ∗ κ] (t) = κ(t) κ(0) κ(0) (7) where δ(t) is the Dirac-delta function. The shape of this kernel depends on κ(t) and can be computed from Table 3.1 (see Figure 1 B). Finally, the AdEx mediates spike-frequency adaptation by the jump of the second variables w. From Eq. 2 we can see that this produces a current wspike (t) = b exp (−t/τw ) that can cumulate over subsequent spikes. The effect of this current on voltage is then given by the convolution of wspike (t) with the membrane ﬁlter κ(t). Thus in the SRM framework the spike-frequency adaptation is taken into account by: ηw (t) = [wspike ∗ κ](t) (8) Again the precise form of ηw (t) depends on κ(t) and can be computed from Table 3.1 (see Figure 1 B). At this point, we would like to verify our assumption that the non-linearity for spike emission can be neglected. Fig. 1 C and D shows that the error between the voltage from Eq. 3 and the voltage from the stochastic AdEx is generally small. Moreover, we see that the main contribution to the voltage prediction error is due to the mismatch close to the spikes. However the non-linearity for spike initiation may change the probability distribution of the voltage ﬂuctuations, which in turn inﬂuences the probability of spiking. This will inﬂuence the choice of the link-function, as we will see in the next section. 3.2 Spike Generation Using κ(t), ηv (t) and ηw (t), we must relate the spiking probability of the stochastic AdEx as a function of its deterministic voltage. According to [2] the probability of spiking in time bin dt given the deterministic voltage V (t) is given by: p(V ) = prob{spike in [t, t + dt]} = 1 − exp (−f (V (t))dt) (9) where f (·) gives the ﬁring intensity as a function of the deterministic V (t) (Eq. 3). Thus to extract the link-function f we have to compute the probability of spiking given V (t) for our SRM. To do so we apply the method proposed by Jolivet et al. (2004) [18], where the probability of spiking is simply given by the distribution of the deterministic voltage estimated at the spike times divided by the distribution of the SRM voltage when there is no spike (see ﬁgure 2 A). One can numerically compute these two quantities for our models using N repetitions of the same stimulus. 4 Table 1: Analytical expressions for the membrane ﬁlter κ(t) in terms of the parameters of the AdEx for over-, critically-, and under-damped cases. Membrane Filter: κ(t) over-damped if: (τm + τw )2 > 4τm τw (gl +a) gl κ(t) = k1 eλ1 t + k2 eλ2 t λ1 = 1 2τm τw (−(τm + τw ) + critically-damped if: (τm + τw )2 = 4τm τw (gl +a) gl κ(t) = (αt + β)eλt λ= under-damped if: (τm + τw )2 < 4τm τw (gl +a) gl κ(t) = (k1 cos (ωt) + k2 sin (ωt)) eλt −(τm +τw ) 2τm τw λ= −(τm +τw ) 2τm τw (τm + τw )2 − 4 τm τw (gl + a) gl λ2 = 1 2τm τw (−(τm + τw ) − α= τm −τw 2Cτm τw ω= τw −τm 2τm τw 2 − a g l τm τw (τm + τw )2 − 4 τm τw (gl + a) gl k1 = −(1+(τm λ2 )) Cτm (λ1 −λ2 ) k2 = 1+(τm λ1 ) Cτm (λ1 −λ2 ) β= 1 C k1 = k2 = 1 C −(1+τm λ) Cωτm The standard deviation σ of the noise and the parameter ∆T of the AdEx non-linearity may affect the shape of the link-function. We thus extract p(V ) for different σ and ∆T (Fig. 2 B). Then using visual heuristics and previous knowledge about the potential analytical expression of the link-funtion, we try to ﬁnd a simple analytical function that captures p(V ) for a large range of combinations of σ and ∆T . We observed that the log(− log(p)) is close to linear in most studied conditions Fig. 2 B suggesting the following two distributions of p(V ): V − VT (10) p(V ) = 1 − exp − exp ∆V V − VT p(V ) = exp − exp − (11) ∆V Once we have p(V ), we can use Eq. 4 to obtain the equivalent SRM link-function, which leads to: −1 f (V ) = log (1 − p(V )) (12) dt Then the two potential link-functions of the SRM can be derived from Eq. 10 and Eq. 11 (respectively): V − VT f (V ) = λ0 exp (13) ∆V V − VT (14) f (V ) = −λ0 log 1 − exp − exp − ∆V 1 with λ0 = dt , VT the threshold of the SRM and ∆V the sharpness of the link-function (i.e. the parameters that governs the degree of the stochasticity). Note that the exact value of λ0 has no importance since it is redundant with VT . Eq. 13 is the standard exponential link-function, but we call Eq. 14 the log-exp-exp link-function. 3.3 Prediction The next point is to evaluate the ﬁt quality of each link-function. To do this, we ﬁrst estimate the parameters VT and ∆V of the GLM link-function that maximize the likelihood of observing a spike 5 Figure 2: SRM link-function. A. Histogram of the SRM voltage at the AdEx ﬁring times (red) and at non-ﬁring times (gray). The ratio of the two distributions gives p(V ) (Eq. 9, dashed lines). Inset, zoom to see the voltage histogram evaluated at the ﬁring time (red). B. log(− log(p)) as a function of the SRM voltage for three different noise levels σ = 0.07, 0.14, 0.18 nA (pale gray, gray, black dots, respectively) and ∆T = 1 mV. The line is a linear ﬁt corresponding to the log-exp-exp linkfunction and the dashed line corresponds to a ﬁt with the exponential link-function. C. Same data and labeling scheme as B, but plotting f (V ) according to Eq. 12. The lines are produced with Eq. 14 with parameters ﬁtted as described in B. and the dashed lines are produced with Eq. 13. Inset, same plot but on a semi-log(y) axis. train generated with an AdEx. Second we look at the predictive power of the resulting SRM in terms of Peri-Stimulus Time Histogram (PSTH). In other words we ask how close the spike trains generated with a GLM are from the spike train generated with a stochastic AdEx when both models are stimulated with the same input current. For any GLM with link-function f (V ) ≡ f (t|I, θ) and parameters θ regulating the shape of κ(t), ˆ ηv (t) and ηw (t), the Negative Log-Likelihood (NLL) of observing a spike-train {t} is given by:   NLL = − log(f (t|I, θ)) − f (t|I, θ) (15) t ˆ t It has been shown that the negative log-likelihood is convex in the parameters if f is convex and logconcave [19]. It is easy to show that a linear-rectiﬁer link-function, the exponential link-function and the log-exp-exp link-function all satisfy these conditions. This allows efﬁcient estimation of ˆ ˆ the optimal parameters VT and ∆V using a simple gradient descent. One can thus estimate from a single AdEx spike train the optimal parameters of a given link-function, which is more efﬁcient than the method used in Sect. 3.2. The minimal NLL resulting from the gradient descent gives an estimation of the ﬁt quality. A better estimate of the ﬁt quality is given by the distance between the PSTHs in response to stimuli not used for parameter ﬁtting . Let ν1 (t) be the PSTH of the AdEx, and ν2 (t) be the PSTH of the ﬁtted SRM, 6 Figure 3: PSTH prediction. A. Injected current. B. Voltage traces produced by an AdEx (black) and the equivalent SRM (red), when stimulated with the current in A. C. Raster plot for 20 realizations of AdEx (black tick marks) and equivalent SRM (red tick marks). D. PSTH of the AdEx (black) and the SRM (red) obtained by averaging 10,000 repetitions. E. Optimal log-likelihood for the three cases of the AdEx, using three different link-functions, a linear-rectiﬁer (light gray), an exponential link-function (gray) and the link-function deﬁned by Eq. 14 (dark gray), these values are obtained by averaging over 40 different combinations σ and ∆T (see Fig. 4). Error bars are one standard deviation, the stars denote a signiﬁcant difference, two-sample t-test with α = 0.01. F. same as E. but for Md (Eq. 16). then we use Md ∈ [0, 1] as a measure of match: Md = 2 2 (ν1 (t) − ν2 (t)) dt ν1 (t)2 dt + ν2 (t)2 dt (16) Md = 1 means that it is impossible to differentiate the SRM from the AdEx in terms of their PSTHs, whereas a Md of 0 means that the two PSTHs are completely different. Thus Md is a normalized similarity measure between two PSTHs. In practice, Md is estimated from the smoothed (boxcar average of 1 ms half-width) averaged spike train of 1 000 repetitions for each models. We use both the NLL and Md to quantify the ﬁt quality for each of the three damping cases and each of the three link-functions. Figure 3 shows the match between the stochastic AdEx used as a reference and the derived GLM when both are stimulated with the same input current (Fig. 3 A). The resulting voltage traces are almost identical (Fig. 3 B) and both models predict almost the same spike trains and so the same PSTHs (Fig. 3 C and D). More quantitalively, we see on Fig. 3 E and F, that the linear-rectiﬁer ﬁts signiﬁcantly worse than both the exponential and log-exp-exp link-functions, both in terms of NLL and of Md . The exponential link-function performs as well as the log-exp-exp link-function, with a spike train similarity measure Md being almost 1 for both. Finally the likelihood-based method described above gives us the opportunity to look at the relationship between the AdEx parameters σ and ∆T that governs its spike emission and the parameters VT and ∆V of the link-function (Fig. 4). We observe that an increase of the noise level produces a ﬂatter link-function (greater ∆V ) while an increase in ∆T also produces an increase in ∆V and VT (note that Fig. 4 shows ∆V and VT for the exponential link-function only, but equivalent results are obtained with the log-exp-exp link-function). 4 Discussion In Sect. 3.3 we have shown that it is possible to predict with almost perfect accuracy the PSTH of a stochastic AdEx model using an appropriate set of parameters in the SRM. Moreover, since 7 Figure 4: Inﬂuence of the AdEx parameters on the parameters of the exponential link-function. A. VT as a function of ∆T and σ. B. ∆V as a function of ∆T and σ. the subthreshold voltage of the AdEx also gives a good match with the deterministic voltage of the SRM, we expect that the AdEx and the SRM will not differ in higher moments of the spike train probability distributions beyond the PSTH. We therefore conclude that diffusive noise models of the type of Eq. 1-2 are equivalent to GLM of the type of Eq. 3-4. Once combined with similar results on other types of stochastic LIF (e.g. correlated noise), we could bridge the gap between the literature on GLM and the literature on diffusive noise models. Another noteworthy observation pertains to the nature of the link-function. The link-function has been hypothesized to be a linear-rectiﬁer, an exponential, a sigmoidal or a Gaussian [16]. We have observed that for the AdEx the link-function follows Eq. 14 that we called the log-exp-exp linkfunction. Although the link-function is log-exp-exp for most of the AdEx parameters, the exponential link-function gives an equivalently good prediction of the PSTH. This can be explained by the fact that the difference between log-exp-exp and exponential link-functions happens mainly at low voltage (i.e. far from the threshold), where the probability of emitting a spike is so low (Figure 2 C, until -50 mv). Therefore, even if the exponential link-function overestimates the ﬁring probability at these low voltages it rarely produces extra spikes. At voltages closer to the threshold, where most of the spikes are emitted, the two link-functions behave almost identically and hence produce the same PSTH. The Gaussian link-function can be seen as lying in-between the exponential link-function and the log-exp-exp link-function in Fig. 2. This means that the work of Plesser and Gerstner (2000) [16] is in agreement with the results presented here. The importance of the time-derivative of the ˙ voltage stressed by Plesser and Gerstner (leading to a two-dimensional link-function f (V, V )) was not studied here to remain consistent with the typical usage of GLM in neural systems [14]. Finally we restricted our study to exponential non-linearity for spike initiation and do not consider other cases such as the Quadratic Integrate-and-ﬁre (QIF, [5]) or other polynomial functional shapes. We overlooked these cases for two reasons. First, there are many evidences that the non-linearity in neurons (estimated from in-vitro recordings of Pyramidal neurons) is well approximated by a single exponential [9]. Second, the exponential non-linearity of the AdEx only affects the subthreshold voltage at high voltage (close to threshold) and thus can be neglected to derive the ﬁlters κ(t) and η(t). Polynomial non-linearities on the other hand affect a larger range of the subthreshold voltage so that it would be difﬁcult to justify the linearization of subthreshold dynamics essential to the method presented here. References [1] R. B. Stein, “Some models of neuronal variability,” Biophys J, vol. 7, no. 1, pp. 37–68, 1967. [2] W. Gerstner and W. Kistler, Spiking neuron models. Cambridge University Press New York, 2002. [3] E. Izhikevich, “Resonate-and-ﬁre neurons,” Neural Networks, vol. 14, no. 883-894, 2001. [4] M. J. E. Richardson, N. Brunel, and V. Hakim, “From subthreshold to ﬁring-rate resonance,” Journal of Neurophysiology, vol. 89, pp. 2538–2554, 2003. 8 [5] E. Izhikevich, “Simple model of spiking neurons,” IEEE Transactions on Neural Networks, vol. 14, pp. 1569–1572, 2003. [6] S. Mensi, R. Naud, M. Avermann, C. C. H. Petersen, and W. Gerstner, “Parameter extraction and classiﬁcation of three neuron types reveals two different adaptation mechanisms,” Under review. [7] N. Fourcaud-Trocme, D. Hansel, C. V. Vreeswijk, and N. Brunel, “How spike generation mechanisms determine the neuronal response to ﬂuctuating inputs,” Journal of Neuroscience, vol. 23, no. 37, pp. 11 628–11 640, 2003. [8] R. Brette and W. Gerstner, “Adaptive exponential integrate-and-ﬁre model as an effective description of neuronal activity,” Journal of Neurophysiology, vol. 94, pp. 3637–3642, 2005. [9] L. Badel, W. Gerstner, and M. Richardson, “Dependence of the spike-triggered average voltage on membrane response properties,” Neurocomputing, vol. 69, pp. 1062–1065, 2007. [10] P. McCullagh and J. A. Nelder, Generalized linear models, 2nd ed. Chapman & Hall/CRC, 1998, vol. 37. [11] W. Gerstner, J. van Hemmen, and J. Cowan, “What matters in neuronal locking?” Neural computation, vol. 8, pp. 1653–1676, 1996. [12] D. Hubel and T. Wiesel, “Receptive ﬁelds and functional architecture of monkey striate cortex,” Journal of Physiology, vol. 195, pp. 215–243, 1968. [13] J. Pillow, L. Paninski, V. Uzzell, E. Simoncelli, and E. Chichilnisky, “Prediction and decoding of retinal ganglion cell responses with a probabilistic spiking model,” Journal of Neuroscience, vol. 25, no. 47, pp. 11 003–11 013, 2005. [14] K. Doya, S. Ishii, A. Pouget, and R. P. N. Rao, Bayesian brain: Probabilistic approaches to neural coding. The MIT Press, 2007. [15] S. Gerwinn, J. H. Macke, M. Seeger, and M. Bethge, “Bayesian inference for spiking neuron models with a sparsity prior,” in Advances in Neural Information Processing Systems, 2007. [16] H. Plesser and W. Gerstner, “Noise in integrate-and-ﬁre neurons: From stochastic input to escape rates,” Neural Computation, vol. 12, pp. 367–384, 2000. [17] J. Schemmel, J. Fieres, and K. Meier, “Wafer-scale integration of analog neural networks,” in Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, june 2008, pp. 431 –438. [18] R. Jolivet, T. Lewis, and W. Gerstner, “Generalized integrate-and-ﬁre models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy,” Journal of Neurophysiology, vol. 92, pp. 959–976, 2004. [19] L. Paninski, “Maximum likelihood estimation of cascade point-process neural encoding models,” Network: Computation in Neural Systems, vol. 15, pp. 243–262, 2004. 9

6 0.13232745 24 nips-2011-Active learning of neural response functions with Gaussian processes

7 0.12357576 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

8 0.11398854 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

9 0.10515267 198 nips-2011-On U-processes and clustering performance

10 0.10425301 87 nips-2011-Energetically Optimal Action Potentials

11 0.10021984 225 nips-2011-Probabilistic amplitude and frequency demodulation

12 0.097739883 23 nips-2011-Active dendrites: adaptation to spike-based communication

13 0.093949862 86 nips-2011-Empirical models of spiking in neural populations

14 0.089960985 165 nips-2011-Matrix Completion for Multi-label Image Classification

15 0.088092752 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

16 0.082851835 249 nips-2011-Sequence learning with hidden units in spiking neural networks

17 0.076420642 89 nips-2011-Estimating time-varying input signals and ion channel states from a single voltage trace of a neuron

18 0.073272862 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

19 0.071915403 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

20 0.071256422 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.182), (1, 0.13), (2, 0.229), (3, -0.048), (4, 0.068), (5, 0.051), (6, -0.069), (7, 0.022), (8, 0.007), (9, 0.088), (10, 0.097), (11, 0.228), (12, -0.057), (13, -0.074), (14, -0.125), (15, -0.044), (16, -0.015), (17, -0.034), (18, -0.022), (19, 0.002), (20, 0.137), (21, 0.028), (22, 0.023), (23, -0.097), (24, 0.019), (25, 0.053), (26, 0.019), (27, 0.121), (28, -0.126), (29, 0.097), (30, 0.089), (31, -0.052), (32, 0.041), (33, 0.151), (34, 0.065), (35, 0.139), (36, -0.047), (37, -0.062), (38, 0.058), (39, 0.082), (40, -0.034), (41, -0.076), (42, 0.029), (43, 0.036), (44, 0.088), (45, -0.016), (46, 0.104), (47, -0.09), (48, -0.036), (49, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96223295 13 nips-2011-A blind sparse deconvolution method for neural spike identification

Author: Chaitanya Ekanadham, Daniel Tranchina, Eero P. Simoncelli

2 0.89226747 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data

Author: Bo Chen, David E. Carlson, Lawrence Carin

3 0.67951941 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models

Author: Skander Mensi, Richard Naud, Wulfram Gerstner

4 0.57872057 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

Author: Konrad Koerding, Ian Stevenson

5 0.49859977 24 nips-2011-Active learning of neural response functions with Gaussian processes

Author: Mijung Park, Greg Horwitz, Jonathan W. Pillow

Abstract: A sizeable literature has focused on the problem of estimating a low-dimensional feature space for a neuron’s stimulus sensitivity. However, comparatively little work has addressed the problem of estimating the nonlinear function from feature space to spike rate. Here, we use a Gaussian process (GP) prior over the inﬁnitedimensional space of nonlinear functions to obtain Bayesian estimates of the “nonlinearity” in the linear-nonlinear-Poisson (LNP) encoding model. This approach offers increased ﬂexibility, robustness, and computational tractability compared to traditional methods (e.g., parametric forms, histograms, cubic splines). We then develop a framework for optimal experimental design under the GP-Poisson model using uncertainty sampling. This involves adaptively selecting stimuli according to an information-theoretic criterion, with the goal of characterizing the nonlinearity with as little experimental data as possible. Our framework relies on a method for rapidly updating hyperparameters under a Gaussian approximation to the posterior. We apply these methods to neural data from a color-tuned simple cell in macaque V1, characterizing its nonlinear response function in the 3D space of cone contrasts. We ﬁnd that it combines cone inputs in a highly nonlinear manner. With simulated experiments, we show that optimal design substantially reduces the amount of data required to estimate these nonlinear combination rules. 1

6 0.44008547 302 nips-2011-Variational Learning for Recurrent Spiking Networks

7 0.43755212 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

8 0.42419335 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

9 0.41371158 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

10 0.40415993 89 nips-2011-Estimating time-varying input signals and ion channel states from a single voltage trace of a neuron

11 0.40311569 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

12 0.40290675 23 nips-2011-Active dendrites: adaptation to spike-based communication

13 0.38685799 285 nips-2011-The Kernel Beta Process

14 0.38393679 87 nips-2011-Energetically Optimal Action Potentials

15 0.37693033 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis

16 0.35901412 198 nips-2011-On U-processes and clustering performance

17 0.34705213 86 nips-2011-Empirical models of spiking in neural populations

18 0.33346346 225 nips-2011-Probabilistic amplitude and frequency demodulation

19 0.29924774 85 nips-2011-Emergence of Multiplication in a Biophysical Model of a Wide-Field Visual Neuron for Computing Object Approaches: Dynamics, Peaks, & Fits

20 0.29429558 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.021), (4, 0.029), (20, 0.025), (26, 0.011), (31, 0.092), (33, 0.04), (43, 0.044), (45, 0.064), (57, 0.03), (74, 0.062), (83, 0.417), (84, 0.037), (99, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95291454 302 nips-2011-Variational Learning for Recurrent Spiking Networks

Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner

2 0.90758324 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

Author: Omar Z. Khan, Pascal Poupart, John-mark M. Agosta

Abstract: In this paper, we derive a method to reﬁne a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test’s relative diagnostic value. We demonstrate that consistency with an expert’s test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to ﬁnd a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario. 1 INTRODUCTION The problem of learning-by-example has the promise to create strong models from a restricted number of cases; certainly humans show the ability to generalize from limited experience. Machine Learning has seen numerous approaches to learning task performance by imitation, going back to some of the approaches to inductive learning from examples [14]. Of particular interest are problemsolving tasks that use a model to infer the source, or cause of a problem from a sequence of investigatory steps or tests. The speciﬁc example we adopt is a diagnostic task such as appears in medicine, electro-mechanical fault isolation, customer support and network diagnostics, among others. We deﬁne a diagnostic sequence as consisting of the assignment of values to a subset of tests. The diagnostic process embodies the choice of the best next test to execute at each step in the sequence, by measuring the diagnostic value among the set of available tests at each step, that is, the ability of a test to distinguish among the possible causes. One possible implementation with which to carry out this process, the one we apply, is a Bayes network [9]. As with all model-based approaches, provisioning an adequate model can be daunting, resulting in a “knowledge elicitation bottleneck.” A recent approach for easing the bottleneck grew out of the realization that the best time to gain an expert’s insight into the model structure is during the diagnostic process. Recent work in “QueryBased Diagnostics” [1] demonstrated a way to improve model quality by merging model use and model building into a single process. More precisely the expert can take steps to modify the network structure to add or remove nodes or links, interspersed within the diagnostic sequence. In this paper we show how to extend this variety of learning-by-example to include also reﬁnement of model parameters based on the expert’s choice of test, from which we determine constraints. The nature of these constraints, as shown herein, is derived from the value of the tests to distinguish causes, a value referred to informally as value of information [10]. It is the effect of these novel constraints on network parameter learning that is elucidated in this paper. ∗ J. M. Agosta is no longer afﬁliated with Intel Corporation 1 Conventional statistical learning approaches are not suited to this problem, since the number of cases available from diagnostic sessions is small, and the data from any case is sparse. (Only a fraction of the tests are taken.) But more relevant is that one diagnostic sequence from an expert user represents the true behavior expected of the model, rather than a noisy realization of a case generated by the true model. We adopt a Bayesian approach, which offers a principled way to incorporate knowledge (constraints and data, when available) and also consider weakening the constraints, by applying a likelihood to them, so that possibly conﬂicting constraints can be incorporated consistently. Sec. 2 reviews related work and Sec. 3 provides some background on diagnostic networks and model consistency. Then, Sec. 4 describes an augmented Bayesian network that incorporates constraints implied by an expert’s choice of tests. Some sampling techniques are proposed to ﬁnd the Maximum a posterior setting of the parameters given the constraints (and any data available). The approach is evaluated in Sec. 5 on synthetic data and a real world manufacturing diagnostic scenario. Finally, Sec. 6 discusses some future work. 2 RELATED WORK Parameter learning for Bayesian networks can be viewed as searching in a high-dimensional space. Adopting constraints on the parameters based on some domain knowledge is a way of pruning this search space and learning the parameters more efﬁciently, both in terms of data needed and time required. Qualitative probabilistic networks [17] allow qualitative constraints on the parameter space to be speciﬁed by experts. For instance, the inﬂuence of one variable on another, or the combined inﬂuence of multiple variables on another variable [5] leads to linear inequalities on the parameters. Wittig and Jameson [18] explain how to transform the likelihood of violating qualitative constraints into a penalty term to adjust maximum likelihood, which allows gradient ascent and Expectation Maximization (EM) to take into account linear qualitative constraints. Other examples of qualitative constraints include some parameters being larger than others, bounded in a range, within ϵ of each other, etc. Various proposals have been made that exploit such constraints. Altendorf et al. [2] provide an approximate technique based on constrained convex optimization for parameter learning. Niculescu et al. [15] also provide a technique based on constrained optimization with closed form solutions for different classes of constraints. Feelders [6] provides an alternate method based on isotonic regression while Liao and Ji [12] combine gradient descent with EM. de Campos and Ji [4] also use constrained convex optimization, however, they use Dirichlet priors on the parameters to incorporate any additional knowledge. Mao and Lebanon [13] also use Dirichlet priors, but they use probabilistic constraints to allow inaccuracies in the speciﬁcation of the constraints. A major difference between our technique and previous work is on the type of constraints. Our constraints do not need to be explicitly speciﬁed by an expert. Instead, we passively observe the expert and learn from what choices are made and not made [16]. Furthermore, as we shall show later, our constraints are non-convex, preventing the direct application of existing techniques that assume linear or convex functions. We use Beta priors on the parameters, which can easily be extended to Dirichlet priors like previous work. We incorporate constraints in an augmented Bayesian network, similar to Liang et al. [11], though their constraints are on model predictions as opposed to ours which are on the parameters of the network. Finally, we also use the notion of probabilistic constraints to handle potential mistakes made by experts. 3 3.1 BACKGROUND DIAGNOSTIC BAYES NETWORKS We consider the class of bipartite Bayes networks that are widely used as diagnostic models, though our approach can be used for networks with any structure. The network forms a sparse, directed, causal graph, where arcs go from causes to observable node variables. We use upper case to denote random variables; C for causes, and T for observables (tests). Lower case letters denote values in the domain of a variable, e.g. c ∈ dom(C) = {c, c}, and bold letters denote sets of variables. A ¯ set of marginally independent binary-valued node variables C with distributions Pr(C) represent unobserved causes, and condition the remaining conditionally independent binary-valued test vari2 able nodes T. Each cause conditions one or more tests; likewise each test is conditioned by one or more causes, resulting in a graph with one or more possibly multiply-connected components. The test variable distributions Pr(T |C) incorporate the further modeling assumption of Independence of Causal Inﬂuence, the most familiar example being the Noisy-Or model [8]. To keep the exposition simple, we assume that all variables are binary and that conditional distributions are parametrized by the Noisy-Or; however, the algorithms described in the rest of the paper generalize to any discrete non-binary variable models. Conventionally, unobserved tests are ranked in a diagnostic Bayes network by their Value Of Information (VOI) conditioned on tests already observed. To be precise, VOI is the expected gain in utility if the test were to be observed. The complete computation requires a model equivalent to a partially observable Markov decision process. Instead, VOI is commonly approximated by a greedy computation of the Mutual Information between a test and the set of causes [3]. In this case, it is easy to show that Mutual Information is in turn well approximated to second order by the Gini impurity [7] as shown in Equation 1. ] [∑ ∑ GI(C|T ) = Pr(T = t) Pr(C = c|T = t)(1 − Pr(C = c|T = t)) (1) t c We will use the Gini measure as a surrogate for VOI, as a way to rank the best next test in the diagnostic sequence. 3.2 MODEL CONSISTENCY A model that is consistent with an expert would generate Gini impurity rankings consistent with the expert’s diagnostic sequence. We interpret the expert’s test choices as implying constraints on Gini impurity rankings between tests. To that effect, [1] deﬁnes the notion of Cause Consistency and Test Consistency, which indicate whether the cause and test orderings induced by the posterior distribution over causes and the VOI of each test agree with an expert’s observed choice. Assuming that the expert greedily chooses the most informative test T ∗ (i.e., test that yields the lowest Gini impurity) at each step, then the model is consistent with the expert’s choices when the following constraints are satisﬁed: GI(C|T ∗ ) ≤ GI(C|Ti ) ∀i (2) We demonstrate next how to exploit these constraints to reﬁne the Bayes network. 4 MODEL REFINEMENT Consider a simple diagnosis example with two possible causes C1 and C2 and two tests T1 and T2 as shown in Figure 1. To keep the exposition simple, suppose that the priors for each cause are known (generally separate data is available to estimate these), but the conditional distribution of each test is unknown. Using the Noisy-OR parameterizations for the conditional distributions, the number of parameters are linear in the number of parents instead of exponential. ∏ i i Pr(Ti = true|C) = 1 − (1 − θ0 ) (1 − θj ) (3) j|Cj =true i Here, θ0 = Pr(Ti = true|Cj = f alse ∀j) is the leak probability that Ti will be true when none of i the causes are true and θj = Pr(Ti = true|Cj = true, Ck = f alse ∀k ̸= j) is the link reliability, which indicates the independent contribution of cause Cj to the probability that test Ti will be true. In the rest of this section, we describe how to learn the θ parameters while respecting the constraints implied by test consistency. 4.1 TEST CONSISTENCY CONSTRAINTS Suppose that an expert chooses test T1 instead of test T2 during the diagnostic process. This ordering by the expert implies that the current model (parametrized by the θ’s) must be consistent with the constraint GI(C|T2 ) − GI(C|T1 ) ≥ 0. Using the deﬁnition of Gini impurity in Eq. 1, we can rewrite 3 Figure 1: Network with 2 causes and 2 tests Figure 2: Augmented network with parameters and constraints Figure 3: Augmented network extended to handle inaccurate feedback the constraint for the network shown in Fig. 1 as follows: ∑ t1 ( ∑ (Pr(t1 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 Pr(t1 ) − Pr(t1 ) c ,c 1 2 ) ( ) ∑ ∑ (Pr(t2 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 − Pr(t2 ) − ≥0 Pr(t2 ) t c ,c 2 1 2 (4) Furthermore, using the Noisy-Or encoding from Eq. 3, we can rewrite the constraint as a polynomial in the θ’s. This polynomial is non-linear, and in general, not concave. The feasible space may consist of disconnected regions. Fig. 4 shows the surface corresponding to the polynomial for the 2 1 i i case where θ0 = 0 and θ1 = 0.5 for each test i, which leaves θ2 and θ2 as the only free variables. The parameters’ feasible space, satisfying the constraint consists of the two disconnected regions where the surface is positive. 4.2 AUGMENTED BAYES NETWORK Our objective is to learn the θ parameters of diagnostic Bayes networks given test constraints of the form described in Eq. 4. To deal with non-convex constraints and disconnected feasible regions, we pursue a Bayesian approach whereby we explicitly model the parameters and constraints as random variables in an augmented Bayes network (see Fig. 2). This allows us to frame the problem of learning the parameters as an inference problem in a hybrid Bayes network of discrete (T, C, V ) and continuous (Θ) variables. As we will see shortly, this augmented Bayes network provides a unifying framework to simultaneously learn from constraints and data, to deal with possibly inconsistent constraints, and to express preferences over the degree of satisfaction of the constraints. We encode the constraint derived from the expert feedback as a binary random variable V in the Bayes network. If V is true the constraint is satisﬁed, otherwise it is violated. Thus, if V is true then Θ lies in the positive region of Fig. 4, and if V is f alse then Θ lies in the negative region. We model the CPT for V as Pr(V |Θ) = max(0, π), where π = GI(C|T1 ) − GI(C|T2 ). Note that the value of GI(C|T ) lies in the interval [0,1], so the probability π will always be normalized. The intuition behind this deﬁnition of the CPT for V is that a constraint is more likely to be satisﬁed if the parameters lie in the interior of the constraint region. We place a Beta prior over each Θ parameter. Since the test variables are conditioned on the Θ parameters that are now part of the network, their conditional distributions become known. For instance, the conditional distribution for Ti (given in Eq. 3) is fully deﬁned given the noisy-or parami eters θj . Hence the problem of learning the parameters becomes an inference problem to compute posteriors over the parameters given that the constraint is satisﬁed (and any data). In practice, it is more convenient to obtain a single value for the parameters instead of a posterior distribution since it is easier to make diagnostic predictions based on one Bayes network. We estimate the parameters by computing a maximum a posteriori (MAP) hypothesis given that the constraint is satisﬁed (and any data): Θ∗ = arg maxΘ Pr(Θ|V = true). 4 Algorithm 1 Pseudo Code for Gibbs Sampling, Stochastic Hill Climbing and Greedy Search 1 Fix observed variables, let V = true and randomly sample feasible starting state S 2 for i = 1 to #samples 3 for j = 1 to #hiddenV ariables 4 acceptSample = f alse; k = 0 5 repeat 6 Sample s′ from conditional of j th hidden variable Sj 7 S′ = S; Sj = s′ 8 if Sj is cause or test, then acceptSample = true 9 elseif S′ obeys constraints V∗ 10 if algo == Gibbs 11 Sample u from uniform distribution, U(0,1) p(S′ 12 if u < M q(S)′ ) where p and q are the true and proposal distributions and M > 1 13 acceptSample = true 14 elseif algo = = StochasticHillClimbing 15 if likelihood(S′ ) > likelihood(S), then acceptSample = true 16 elseif algo = = Greedy, then acceptSample = true 17 elseif algo = = Greedy 18 k = k+1 19 if k = = maxIterations, then s′ = Sj ; acceptSample = true 20 until acceptSample = = true 21 Sj = s′ 4.3 MAP ESTIMATION Previous approaches for parameter learning with domain knowledge include modiﬁed versions of EM or some other optimization techniques that account for linear/convex constraints on the parameters. Since our constraints are non-convex, we propose a new approach based on Gibbs sampling to approximate the posterior distribution, from which we compute the MAP estimate. Although the technique converges to the MAP in the limit, it may require excessive time. Hence, we modify Gibbs sampling to obtain more efﬁcient stochastic hill climbing and greedy search algorithms with anytime properties. The pseudo code for our Gibbs sampler is provided in Algorithm 1. The two key steps are sampling the conditional distributions of each variable (line 6) and rejection sampling to ensure that the constraints are satisﬁed (lines 9 and 12). We sample each variable given the rest according to the following distributions: ti ∼ Pr(Ti |c, θi ) ∀i cj ∼ Pr(Cj |c − cj , t, θ) ∝ ∏ Pr(Cj ) j ∏ (5) Pr(ti |c, θi ) ∀j i i θj ∼ Pr(Θi |Θ − Θi , t, c, v) ∝ Pr(v|t, Θ) j j ∏ Pr(ti |cj , θi ) ∀i, j (6) (7) i The tests and causes are easily sampled from the multinomials as described in the equations above. However, sampling the θ’s is more difﬁcult due to the factor Pr(v|Θ, t) = max(0, π), which is a truncated mixture of Betas. So, instead of sampling θ from its true conditional, we sample it from a proposal distribution that replaces max(0, π) by an un-truncated mixture of Betas equal to π + a where a is a constant that ensures that π + a is always positive. This is equivalent to ignoring the constraints. Then we ensure that the constraints are satisﬁed by rejecting the samples that violate the constraints. Once Gibbs sampling has been performed, we obtain a sample that approximates the posterior distribution over the parameters given the constraints (and any data). We return a single setting of the parameters by selecting the sampled instance with the highest posterior probability (i.e., MAP estimate). Since we will only return the MAP estimate, it is possible to speed up the search by modifying Gibbs sampling. In particular, we obtain a stochastic hill climbing algorithm by accepting a new sample only if its posterior probability improves upon that of the previous sample 5 Posterior Probability 0.1 0.08 Difference in Gini Impurity 0.1 0.05 0 −0.05 0.06 0.04 0.02 −0.1 1 0 1 1 0.8 0.5 0.6 0.8 0.4 Link Reliability of Test 2 and Cause 1 0 0.6 0.2 0 0.4 Link Reliability of Test 2 and Cause 2 Figure 4: Difference in Gini impurity for the network in 1 2 Fig. 1 when θ2 and θ2 are the only parameters allowed to vary. 0.2 Link Reliability of Test 2 and Cause 1 0 0 0.2 0.4 0.6 0.8 1 Link Reliability of Test 2 and Cause 1 Figure 5: Posterior over parameters computed through calculation after discretization. Figure 6: Posterior over parameters calculated through Sampling. (line 15). Thus, each iteration of the stochastic hill climber requires more time, but always improves the solution. As the number of constraints grows and the feasibility region shrinks, the Gibbs sampler and stochastic hill climber will reject most samples. We can mitigate this by using a Greedy sampler that caps the number of rejected samples, after which it abandons the sampling for the current variable to move on to the next variable (line 19). Even though the feasibility region is small overall, it may still be large in some dimensions, so it makes sense to try sampling another variable (that may have a larger range of feasible values) when it is taking too long to ﬁnd a new feasible value for the current variable. 4.4 MODEL REFINEMENT WITH INCONSISTENT CONSTRAINTS So far, we have assumed that the expert’s actions generate a feasible region as a consequence of consistent constraints. We handle inconsistencies by further extending our augmented diagnostic Bayes network. We treat the observed constraint variable, V , as a probabilistic indicator of the true constraint V ∗ as shown in Figure 3. We can easily extend our techniques for computing the MAP to cater for this new constraint node by sampling an extra variable. 5 EVALUATION AND EXPERIMENTS 5.1 EVALUATION CRITERIA Formally, for M ∗ , the true model that we aim to learn, the diagnostic process determines the choice of best next test as the one with the smallest Gini impurity. If the correct choice for the next test is known (such as demonstrated by an expert), we can use this information to include a constraint on the model. We denote by V+ the set of observed constraints and by V∗ the set of all possible constraints that hold for M ∗ . Having only observed V+ , our technique will consider any M + ∈ M+ as a possible true model, where M+ is the set of all models that obey V + . We denote by M∗ the set of all models that are diagnostically equivalent to M ∗ (i.e., obey V ∗ and would recommend the MAP same steps as M ∗ ) and by MV+ the particular model obtained by MAP estimation based on the MAP constraints V+ . Similarly, when a dataset D is available, we denote by MD the model obtained MAP by MAP estimation based on D and by MDV+ , the model based on D and V+ . Ideally we would like to ﬁnd the true underlying model M ∗ , hence we will report the KL divergence between the models found and M ∗ . However, other diagnostically equivalent M ∗ may recommend the same tests as M ∗ and thus have similar constraints, so we also report test consistency with M ∗ (i.e., # of recommended tests that are the same). 5.2 CORRECTNESS OF MODEL REFINEMENT Given V∗ , our technique for model adjustment is guaranteed to choose a model M MAP ∈ M∗ by construction. If any constraint V ∗ ∈ V∗ is violated, the rejection sampling step of our technique 6 100 Comparing convergence of Different Techniques 80 70 60 50 40 30 Data Only Constraints Only Data+Constraints 20 10 0 1 2 3 4 5 Number of constraints used 6 −10 −12 −14 −16 −18 7 −20 Figure 7: Mean KLdivergence and one standard deviation for a 3 cause 3 test network on learning with data, constraints and data+constraints. Gibbs Sampling Stochastic Hill Climbing Greedy Sampling −8 Negative Log Likelihood of MAP Estimate Percentage of tests correctly predicted 90 0 1 2 3 10 10 10 10 Elapsed Time (plotted on log scale from 0 to 1500 seconds) Figure 8: Test Consistency for a 3 cause 3 test network on learning with data, constraints and data+constraints. Figure 9: Convergence rate comparison. would reject that set of parameters. To illustrate this, consider the network in Fig. 2. There are six parameters (four link reliabilities and two leak parameters). Let us ﬁx the leak parameters and the link reliability from the ﬁrst cause to each test. Now we can compute the posterior surface over the two variable parameters after discretizing each parameter in small steps and then calculating the posterior probability at each step as shown in Fig. 5. We can compare this surface with that obtained after Gibbs sampling using our technique as shown in Fig. 6. We can see that our technique recovers the posterior surface from which we can compute the MAP. We obtain the same MAP estimate with the stochastic hill climbing and greedy search algorithms. 5.3 EXPERIMENTAL RESULTS ON SYNTHETIC PROBLEMS We start by presenting our results on a 3-cause by 3-test fully-connected bipartite Bayes network. We assume that there exists some M ∗ ∈ M∗ that we want to learn given V+ . We use our technique to ﬁnd M MAP . To evaluate M MAP , we ﬁrst compute the constraints, V∗ for M ∗ to get the feasible region associated with the true model. Next, we sample 100 other models from this feasible region that are diagnostically equivalent. We compare these models with M MAP (after collecting 200 samples with non-informative priors for the parameters). We compute the KL-divergence of M MAP with respect to each sampled model. We expect KLdivergence to decrease as the number of constraints in V+ increases since the feasible region beMAP comes smaller. Figure 7 conﬁrms this trend and shows that MDV+ has lower mean KL-divergence MAP MAP than MV+ , which has lower mean KL-divergence than MD . The data points in D are limited to the results of the diagnostic sessions needed to obtain V+ . As constraints increase, more data is available and so the results for the data-only approach also improve with increasing constraints. We also compare the test consistency when learning from data only, constraints only or both. Given a ﬁxed number of constraints, we enumerate the unobserved trajectories, and then compute the highest ranked test using the learnt model and the sampled true models, for each trajectory. The test consistency is reported as a percentage, with 100% consistency indicating that the learned and true models had the same highest ranked tests on every trajectory. Figure 8 presents these percentatges for the greedy sampling technique (the results are similar for the other techniques). It again appears that learning parameters with both constraints and data is better than learning with only constraints, which is most of the times better than learning with only data. Figure 9 compares the convergence rate of each technique to ﬁnd the MAP estimate. As expected, Stochastic Hill Climbing and Greedy Sampling take less time than Gibbs sampling to ﬁnd parameter settings with high posterior probability. 5.4 EXPERIMENTAL RESULTS ON REAL-WORLD PROBLEMS We evaluate our technique on a real-world diagnostic network collected and reported by Agosta et al. [1], where the authors collected detailed session logs over a period of seven weeks in which the 7 KL−divergence of when computing joint over all tests 8 Figure 10: Diagnostic Bayesian network collected from user trials and pruned to retain sub-networks with at least one constraint Data Only Constraints Only Data+Constraints 7 6 5 4 3 2 1 6 8 10 12 14 16 Number of constraints used 18 20 22 Figure 11: KL divergence comparison as the number of constraints increases for the real world problem. entire diagnostic sequence was recorded. The sequences intermingle model building and querying phases. The model network structure was inferred from an expert’s sequence of positing causes and tests. Test-ranking constraints were deduced from the expert’s test query sequences once the network structure is established. The 157 sessions captured over the seven weeks resulted in a Bayes network with 115 tests, 82 root causes and 188 arcs. The network consists of several disconnected sub-networks, each identiﬁed with a symptom represented by the ﬁrst test in the sequence, and all subsequent tests applied within the same subnet. There were 20 sessions from which we were able to observe trajectories with at least two tests, resulting in a total of 32 test constraints. We pruned our diagnostic network to remove the sub-networks with no constraints to get a Bayes network with 54 tests, 30 root causes, and 67 parameters divided in 7 sub-networks, as shown in Figure 10, on which we apply our model reﬁnement technique to learn the parameters for each sub-network separately. Since we don’t have the true underlying network and the full set of constraints (more constraints could be observed in future diagnostic sessions), we treated the 32 constraints as if they were V∗ and the corresponding feasible region M∗ as if it contained models diagnostically equivalent to the unknown true model. Figure 11 reports the KL divergence between the models found by our algorithms and sampled models from M∗ as we increase the number of constraints. With such limited constraints and consequently large feasible regions, it is not surprising that the variation in KL divergence is large. Again, the MAP estimate based on both the constraints and the data has lower KL divergence than constraints only and data only. 6 CONCLUSION AND FUTURE WORK In summary, we presented an approach that can learn the parameters of a Bayes network based on constraints implied by test consistency and any data available. While several approaches exist to incorporate qualitative constraints in learning procedures, our work makes two important contributions: First, this is the ﬁrst approach that exploits implicit constraints based on value of information assessments. Secondly it is the ﬁrst approach that can handle non-convex constraints. We demonstrated the approach on synthetic data and on a real-world manufacturing diagnostic problem. Since data is generally sparse in diagnostics, this work makes an important advance to mitigate the model acquisition bottleneck, which has prevented the widespread application of diagnostic networks so far. In the future, it would be interesting to generalize this work to reinforcement learning in applications where data is sparse, but constraints may be inferred from expert interactions. Acknowledgments This work was supported by a grant from Intel Corporation. 8 References [1] John Mark Agosta, Omar Zia Khan, and Pascal Poupart. Evaluation results for a query-based diagnostics application. In The Fifth European Workshop on Probabilistic Graphical Models (PGM 10), Helsinki, Finland, September 13–15 2010. [2] Eric E. Altendorf, Angelo C. Restiﬁcar, and Thomas G. Dietterich. Learning from sparse data by exploiting monotonicity constraints. In Proceedings of Twenty First Conference on Uncertainty in Artiﬁcial Intelligence (UAI), Edinburgh, Scotland, July 2005. [3] Brigham S. Anderson and Andrew W. Moore. Fast information value for graphical models. In Proceedings of Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 51–58, Vancouver, BC, Canada, December 2005. [4] Cassio P. de Campos and Qiang Ji. Improving Bayesian network parameter learning using constraints. In International Conference in Pattern Recognition (ICPR), Tampa, FL, USA, 2008. [5] Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 141–148, Montreal, QC, Canada, 1995. [6] Ad J. Feelders. A new parameter learning method for Bayesian networks with qualitative inﬂuences. In Proceedings of Twenty Third International Conference on Uncertainty in Artiﬁcial Intelligence (UAI), Vancouver, BC, July 2007. [7] Mara Angeles Gil and Pedro Gil. A procedure to test the suitability of a factor for stratiﬁcation in estimating diversity. Applied Mathematics and Computation, 43(3):221 – 229, 1991. [8] David Heckerman and John S. Breese. Causal independence for probability assessment and inference using bayesian networks. IEEE Systems, Man, and Cybernetics, 26(6):826–831, November 1996. [9] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49–56, 1995. [10] Ronald A. Howard. Information value theory. IEEE Transactions on Systems Science and Cybernetics, 2(1):22–26, August 1966. [11] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponential families. In Proceedings of Twenty Sixth Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, June 2009. [12] Wenhui Liao and Qiang Ji. Learning Bayesian network parameters under incomplete data with domain knowledge. Pattern Recognition, 42:3046–3056, 2009. [13] Yi Mao and Guy Lebanon. Domain knowledge uncertainty and probabilistic parameter constraints. In Proceedings of Twenty Fifth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), Montreal, QC, Canada, 2009. [14] Ryszard S. Michalski. A theory and methodology of inductive learning. Artiﬁcial Intelligence, 20:111–116, 1984. [15] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with parameter constraints. Journal of Machine Learning Research, 7:1357–1383, 2006. [16] Mark A. Peot and Ross D. Shachter. Learning from what you dont observe. In Proceedings of the Fourteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 439–446, Madison, WI, July 1998. [17] Michael P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artiﬁcial Intelligence, 44(3):257–303, August 1990. [18] Frank Wittig and Anthony Jameson. Exploiting qualitative knowledge in the learning of conditional probabilities of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), San Francisco, CA, July 2000. 9

same-paper 3 0.89881927 13 nips-2011-A blind sparse deconvolution method for neural spike identification

Author: Chaitanya Ekanadham, Daniel Tranchina, Eero P. Simoncelli

4 0.88945085 145 nips-2011-Learning Eigenvectors for Free

Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth

Abstract: We extend the classical problem of predicting a sequence of outcomes from a ﬁnite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1

5 0.81728524 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik

Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use ﬁrst order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modiﬁcations that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.

6 0.70476991 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

7 0.66292566 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

8 0.64689082 249 nips-2011-Sequence learning with hidden units in spiking neural networks

9 0.61532915 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

10 0.5930931 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models

11 0.57107234 86 nips-2011-Empirical models of spiking in neural populations

12 0.56084496 219 nips-2011-Predicting response time and error rates in visual search

13 0.55990452 102 nips-2011-Generalised Coupled Tensor Factorisation

14 0.5594551 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories

15 0.55393171 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data

16 0.55041897 75 nips-2011-Dynamical segmentation of single trials from population neural data

17 0.53825355 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

18 0.5322994 158 nips-2011-Learning unbelievable probabilities

19 0.52673286 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

20 0.52632374 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes