nips nips2011 nips2011-135 knowledge-graph by maker-knowledge-mining

135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Source: pdf

Author: Kamiar R. Rad, Liam Paninski

Abstract: Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. We ﬁnd that for large neural populations carrying a ﬁnite total amount of information, the full spiking population response is asymptotically as informative as a single observation from a Gaussian process whose mean and covariance can be characterized explicitly in terms of network and single neuron properties. The Gaussian form of this asymptotic sufﬁcient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. One technical advantage of the theory is that it may be applied easily even to non-Poisson point process network models; for example, we ﬁnd that under some conditions, neural populations with strong history-dependent (non-Poisson) effects carry exactly the same information as do simpler equivalent populations of non-interacting Poisson neurons with matched ﬁring rates. We argue that our ﬁndings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 pdf Abstract Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. [sent-7, score-0.515]

2 In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. [sent-8, score-0.109]

3 The Gaussian form of this asymptotic sufﬁcient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. [sent-10, score-0.439]

4 We argue that our ﬁndings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design. [sent-12, score-0.242]

5 Introduction It has long been argued that many key questions in neuroscience can best be posed in informationtheoretic terms; the efﬁcient coding hypothesis discussed in [2, 3, 1], represents perhaps the bestknown example. [sent-13, score-0.144]

6 Answering these questions quantitatively requires us to compute the Shannon information rate of neural channels, whether numerically using experimental data or analytically in mathematical models. [sent-14, score-0.129]

7 In many cases it is useful to exploit connections with “ideal observer” analysis, in which the performance of an optimal Bayesian decoder places fundamental bounds on the performance of any biological system given access to the same neural information. [sent-15, score-0.18]

8 However, the non-linear, non-Gaussian, and correlated nature of neural responses has hampered the development of this theory, particularly in the case of high-dimensional and/or time-varying stimuli. [sent-16, score-0.091]

9 The neural decoding literature is far too large to review systematically here; instead, we will focus our attention on work which has attempted to develop an analytical theory to simplify these complex decoding and information-rate problems. [sent-17, score-0.463]

10 Two limiting regimes have received signiﬁcant analytical attention in the neuroscience literature. [sent-18, score-0.17]

11 On the other hand we can consider the “low-SNR” limit, where only a few neurons are observed and each neuron is asymptotically weakly tuned to the stimulus. [sent-21, score-0.299]

12 In this limit, the Shannon information tends to zero, and under certain conditions the optimal Bayesian estimator (which can be strongly nonlinear in general) can be approximated by a simpler linear estimator; see [5] and more recently [16] for details. [sent-22, score-0.133]

13 Likelihood in the intermediate regime: the inhomogeneous Poisson case For clarity, we begin by analyzing the information in a simple population of neurons, represented as inhomogenous Poisson processes that are conditionally independent given the stimulus. [sent-24, score-0.342]

14 We will extend our analysis to more general neural populations in the next section. [sent-25, score-0.172]

15 In response to the stimulus, at each time step t neuron i ﬁres with probability λi (t)dt, where the rate is given by λi (t) = f [bi (t) + i,t (θ)] , (1) where f (. [sent-26, score-0.211]

16 The baseline ﬁring rate is determined by bi (t) and is independent of the input signal. [sent-28, score-0.326]

17 The true stimulus at time t is deﬁned by θt , and θ abbreviates the time varying stimulus θ0:T in the time interval [0, T dt]. [sent-29, score-0.412]

18 The term i,t (θ) summarizes the dependence of the neuron’s ﬁring rate on θ; depending on the setting, this term may represent e. [sent-30, score-0.096]

19 a tuning curve or a spatiotemporal ﬁlter applied to the stimulus (see examples below). [sent-32, score-0.238]

20 The likelihood includes all the information about the stimulus encoded in the population’s spiking response. [sent-33, score-0.312]

21 Neuron i’s response at time step t is designated by by the binary variable ri (t). [sent-34, score-0.245]

22 The loglikelihood at the parameter value ϑ (which may be different from the true parameter θ) is given by the standard point-process formula [21]: n Lϑ (r) := log p(r|ϑ) T = i=1 t=0 ri (t) log λi (t) − λi (t)dt. [sent-35, score-0.312]

23 (2) This expression can be expanded around = 0: Lϑ (r) = Lϑ (r)| =0 + ∂Lϑ (r) | ∂ =0 + 1 2 2∂ 2 Lϑ (r) | ∂ 2 =0 + O(n 3 ), where ∂Lϑ (r) | ∂ =0 = i,t (ϑ) 2 i,t (ϑ) ri (t) i,t 2 ∂ Lϑ (r) | ∂ 2 ri (t) =0 = i,t f bi (t) − f (bi (t))dt f f f bi (t) − f (bi (t))dt . [sent-36, score-0.996]

24 2 This second-order loglikelihood expansion is standard in likelihood theory [24]; as usual, the ﬁrst term is constant in ϑ and can therefore be ignored, while the third (quadratic) term controls the curvature of the loglikelihood at = 0, and scales as n2 . [sent-39, score-0.343]

25 Now, ﬁnally, we can more precisely deﬁne the “intermediate” SNR regime: we will focus on the case of large populations (n → ∞), but in order to keep the total information in a ﬁnite range we 1 need to scale the sensitivity as ∼ n−1/2 . [sent-41, score-0.154]

26 So the ﬁrst derivative term is the only part of the likelihood that depends both on the neural activity and ϑ, and may therefore be considered a sufﬁcient statistic in this asymptotic regime: all the information about the stimulus is summarized in ∂Lϑ (r) | ∂ =0 1 =√ n T i (ϑ) gi (ri ). [sent-43, score-0.646]

27 (3) i We may further apply the central limit theorem (CLT) to this sum of independent random vectors to conclude that this term converges to a Gaussian process indexed by ϑ (under mild technical conditions that we will ignore here, for clarity). [sent-44, score-0.092]

28 Thus this model enjoys the local asymptotic normality property observed in many parametric statistical models [24]: all of the information in the data can be summarized asymptotically by a sufﬁcient statistic with a sampling distribution that turns out to be Gaussian. [sent-45, score-0.301]

29 Example: Linearly ﬁltered stimuli and state-space models In many cases neurons are modeled in terms of simple rectiﬁed linear ﬁlters responding to the stimulus. [sent-46, score-0.164]

30 We can handle this case easily using the language introduced above, if we let Ki denote the matrix implementing the transformation (Ki θ)t = i,t (θ), the projection of the stimulus onto the i-th neuron’s stimulus ﬁlter. [sent-47, score-0.439]

31 Then, ∂Lϑ (r) | ∂ =0 n 1 = ϑT √ n T Ki diag i=1 fi ri − fi dt fi := ϑT ∆(r), where fi stands for the vector version of f [bi (t)]. [sent-48, score-1.313]

32 Thus all the information in the population spike train can be summarized in the random vector ∆(r), which is a simple linear function of the observed spike train data. [sent-49, score-0.745]

33 fi n Thus, the neural population’s non-linear and temporally dynamic response to the stimulus is as informative in this intermediate regime as a single observation from a standard Gaussian experiment, 3 in which the parameter θ is ﬁltered linearly by J and corrupted by Gaussian noise. [sent-51, score-0.782]

34 All of the ﬁltering properties of the population are summarized by the matrix J. [sent-52, score-0.289]

35 (Note that if we consider each Ki as a random sample from some distribution of ﬁlters, then J will converge by the law of large numbers to a matrix we can compute explicitly. [sent-53, score-0.027]

36 ) Thus in many cases we can perform optimal Bayesian decoding of θ given the spike trains quite easily. [sent-54, score-0.462]

37 For example, if θ has a zero mean Gaussian prior distribution with covariance Cθ , then the posterior mean and the maximum-a-posteriori (MAP) estimate is well-known and coincides with the optimal linear estimate (OLE): −1 ˆ θOLE (r) = E(θ|r) = (J + Cθ )−1 ∆(r). [sent-55, score-0.131]

38 We know that, asymptotically, the sufﬁcient statistic ∆(r) is as informative as the full population response r I(θ : r) = I(θ : ∆(r)). [sent-57, score-0.467]

39 To work through a concrete example, consider the case that the temporal sequence of parameter values θt is generated by an autoregressive process: θt+1 = Aθt + ηt ηt ∼ N (0, R), for a stable dynamics matrix A and positive-semideﬁnite covariance matrix R. [sent-61, score-0.053]

40 , Ki is block-diagonal with blocks Ki,t , and therefore the responses are modeled as ri (t) ∼ P oiss[f (bi (t) + Ki,t θt )dt]. [sent-64, score-0.24]

41 Thus θ and the responses r together represent a state-space model. [sent-65, score-0.036]

42 This framework has been shown to lead to state-of-the-art performance in a wide variety of neural data analysis settings [14]. [sent-66, score-0.055]

43 Optimal decoding of θ given the observation sequence ∆1:T can therefore be accomplished via the standard forward-backward Kalman ﬁlter-smoother [10]; see Fig. [sent-71, score-0.187]

44 The information rate limT →∞ I(θ0:T : r0:T ) = limT →∞ I(θ0:T : ∆(r)0:T ) may be computed via similar recursions in the stationary case (i. [sent-73, score-0.032]

45 In the former case, the stimulus is a phase variable, and therefore does not ﬁt gracefully into the linear setting described above; in the latter case, place ﬁelds and grid ﬁelds are not well-approximated as linear functions of position. [sent-78, score-0.249]

46 If we apply our general theory in these settings, the interpretation of the encoding function i (θ) does not change signiﬁcantly: i (θ) could represent the tuning curve of neuron i as a function of the orientation of the visual stimulus, or of the animal’s location in space. [sent-79, score-0.306]

47 However, without further assumptions the limiting sufﬁcient statistic, which is a weighted sum of these encoding functions i (θ) (recall eq. [sent-80, score-0.15]

48 To simplify matters somewhat, we can introduce a mild assumption on the tuning functions i (θ). [sent-82, score-0.032]

49 Let’s assume that these functions may be expressed in some low-dimensional basis: i (θ) = Ki Φ(θ), for some vectors Ki , and Φ(θ) is deﬁned to map θ into an mT -dimensional space which is usually smaller than dim(θ) = dim(θt )T . [sent-83, score-0.048]

50 The key point here is that we may now simply follow the derivation of the last section with Φ(θ) in place of θ; we ﬁnd that the sufﬁcient statistic may be represented asymptotically as an mT -dimensional Gaussian vector with mean J and covariance JΦ(θ), with J deﬁned as in the preceding section. [sent-85, score-0.29]

51 So in most interesting nonlinear cases we can no longer compute the optimal Bayesian decoder or the Shannon information rate analytically. [sent-87, score-0.211]

52 However, our approach does lead to a major simpliﬁcation in numerical investigations into theoretical coding issues. [sent-88, score-0.041]

53 This short-time limit is sensible in some physiological and psychophysical contexts [22] and was examined analytically in [15] to study the impact of inter-neuron dependencies on information transmission. [sent-92, score-0.03]

54 We begin by writing the loglikelihood of the observed spike count vector r in a single time-bin of length dt: Lϑ (r) := log p(r|θ) ri log f [bi + i (ϑ)] − f [bi + i (ϑ)] dt. [sent-94, score-0.491]

55 = i The second term does not depend on r; therefore, all information in r about θ resides in the sufﬁcient statistic ri log f [bi + i (ϑ)] . [sent-95, score-0.376]

56 ∆ϑ (r) := i Since the i-th neuron ﬁres with probability f [bi + i (θ)] dt, the mean of ∆ϑ (r) scales with ndt, and it is clear that dt = 1/n is a natural scaling of the time bin. [sent-96, score-0.306]

57 In general, this limiting Gaussian process will be inﬁnite-dimensional. [sent-98, score-0.075]

58 )) and the encoding functions i (θ) are of the ﬁnite-dimensional T form considered above, i (θ) = Ki Φ(θ), then the log f [bi + i (ϑ)] term in the deﬁnition of ∆ϑ (r) simpliﬁes: in this case, all information about θ is captured by the sufﬁcient statistic ri Ki . [sent-101, score-0.451]

59 Likelihood in the intermediate regime: non-Poisson effects We conclude by discussing the generalization to non-Poisson networks with interneuronal dependencies and nontrivial correlation structure. [sent-104, score-0.195]

60 Since the encoding ﬁlters Ki act instantaneously in this model (Ki can be represented as a delta function, weighted by n−1/2 ), the observed spike trains can be considered observations from a state-space model, as described above. [sent-106, score-0.354]

61 The weights wji were generated randomly from a uniform distribution on the interval −[5/n, 5/n], with self-weights wii = 0, and j wji = 0 to enforce detailed balance in the network. [sent-107, score-0.122]

62 Note that, while the interneuronal coupling is weak in this example, the autocorrelation in these spike trains is quite strong on short time scales, due to the absolute refractory effect. [sent-108, score-0.408]

63 ˆ We compared two estimators of θ: the full (nonlinear) MAP estimate θMAP = arg maxθ p(θ|r), which we computed using the fast direct optimization methods described in [14], and the limiting −1 ˆ optimal estimator θ∆ := (J + Cθ )−1 ∆(r). [sent-109, score-0.154]

64 Note that J is diagonal; we computed the expectations in the deﬁnition of J using the numerical approach described above in this simulation, though in 7 spike train(s) with 2ms refractory period, 20ms synaptic time constant and baseline rate 30Hz stimuli 5 sufficient statistics Δ(r) 2. [sent-110, score-0.363]

65 2 time(sec) Figure 1: The left panels show the true stimulus (green), MAP estimate (red) and the limiting optimal −1 ˆ estimator θ∆ := (J + Cθ )−1 ∆(r) (blue) for various population sizes n. [sent-133, score-0.659]

66 The middle panels show the spike trains used to compute these estimates. [sent-134, score-0.28]

67 The right panels show the sufﬁcient statistics ∆(r) ˆ used to compute θ∆ . [sent-135, score-0.042]

68 Note that the same true stimulus was used in all three simulations. [sent-136, score-0.206]

69 As n increases, the linear decoder converges to the MAP estimate, despite the nonlinear and correlated nature of the network model generating the spike trains (see main text for details). [sent-137, score-0.419]

70 In addition, Cθ is tridiagonal in this state-space setting; thus the linear matrix equation in eq. [sent-139, score-0.047]

71 (4) can be solved efﬁciently in O(T ) time using standard tridiagonal ˆ matrix solvers. [sent-140, score-0.047]

72 We conclude with a few comments about these results. [sent-142, score-0.03]

73 First, note that the covariance matrix J we have computed here coincides almost exactly with what we computed previously in the Poisson case. [sent-143, score-0.094]

74 Indeed, we can make this connection much more precise: we can always choose an equivalent Poisson network with rates deﬁned so that the Er|θ=0 [(fi )2 /fi ] term in the non-Poisson network matches the (fi )2 /fi term in the Poisson network. [sent-144, score-0.142]

75 Since J determines the information rate completely, we conclude that for any weakly-coupled network there is an equivalent Poisson network which conveys exactly the same information in the intermediate regime. [sent-145, score-0.225]

76 However, note that the the sufﬁcient statistic ∆(r) is different in the Poisson and non-Poisson settings, since the f /f term linearly reweights the observed spikes, depending on how likely they were given the history; thus the optimal Bayesian decoder incorporates non-Poisson effects explicitly. [sent-146, score-0.297]

77 For example, while we expect a LLN and CLT to continue to hold in many cases of strong, structured interneuronal coupling, computing the asymptotic mean and covariance of the sufﬁcient statistic ∆(r) may be more challenging in such cases, and new phenomena may arise. [sent-148, score-0.348]

78 Could information theory provide an ecological theory of sensory processing? [sent-151, score-0.041]

79 A statistical paradigm for neural spike train decoding applied to position prediction from ensemble ﬁring patterns of rat hippocampal place cells. [sent-183, score-0.546]

80 Population decoding of motor cortical activity using a generalized linear model with hidden states. [sent-217, score-0.222]

81 A new look at state-space models for neural data. [sent-238, score-0.055]

82 Correlations and the encoding of information in the nervous system. [sent-245, score-0.075]

83 Model-based decoding, information estimation, and change-point detection in multi-neuron spike trains. [sent-251, score-0.179]

84 Mean-ﬁeld approximations for coupled populations of generalized linear model spiking neurons with Markov refractoriness. [sent-288, score-0.295]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ki', 0.448), ('bi', 0.294), ('population', 0.257), ('fi', 0.216), ('stimulus', 0.206), ('ri', 0.204), ('decoding', 0.187), ('spike', 0.179), ('dt', 0.168), ('regime', 0.15), ('shannon', 0.147), ('statistic', 0.14), ('neuron', 0.138), ('populations', 0.117), ('loglikelihood', 0.108), ('neurons', 0.107), ('covr', 0.107), ('poisson', 0.101), ('decoder', 0.088), ('intermediate', 0.085), ('interneuronal', 0.08), ('diag', 0.077), ('encoding', 0.075), ('limiting', 0.075), ('asymptotic', 0.075), ('spiking', 0.071), ('ref', 0.071), ('orientation', 0.061), ('wji', 0.061), ('rahnama', 0.061), ('neuroscience', 0.061), ('trains', 0.059), ('stimuli', 0.057), ('jt', 0.056), ('neural', 0.055), ('refractory', 0.055), ('rad', 0.055), ('ring', 0.055), ('nonlinear', 0.054), ('asymptotically', 0.054), ('lters', 0.053), ('lln', 0.053), ('ole', 0.053), ('covariance', 0.053), ('snr', 0.049), ('train', 0.049), ('map', 0.048), ('clt', 0.047), ('tridiagonal', 0.047), ('gaussian', 0.044), ('mt', 0.043), ('limt', 0.043), ('place', 0.043), ('panels', 0.042), ('questions', 0.042), ('estimator', 0.042), ('kalman', 0.041), ('sensory', 0.041), ('sec', 0.041), ('response', 0.041), ('coding', 0.041), ('instantaneously', 0.041), ('coincides', 0.041), ('suf', 0.04), ('synaptic', 0.04), ('network', 0.039), ('bayesian', 0.039), ('ahmadian', 0.038), ('ht', 0.038), ('optimal', 0.037), ('sensitivity', 0.037), ('gi', 0.036), ('responses', 0.036), ('asymptotics', 0.035), ('activity', 0.035), ('likelihood', 0.035), ('hi', 0.035), ('coupling', 0.035), ('res', 0.034), ('dim', 0.034), ('analytical', 0.034), ('lter', 0.034), ('hippocampal', 0.033), ('tuning', 0.032), ('rate', 0.032), ('period', 0.032), ('term', 0.032), ('summarized', 0.032), ('paninski', 0.031), ('elds', 0.031), ('limit', 0.03), ('conclude', 0.03), ('informative', 0.029), ('er', 0.028), ('curvature', 0.028), ('fisher', 0.028), ('clarity', 0.028), ('transformation', 0.027), ('law', 0.027), ('fourier', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Author: Kamiar R. Rad, Liam Paninski

2 0.29089019 302 nips-2011-Variational Learning for Recurrent Spiking Networks

Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner

Abstract: We derive a plausible learning rule for feedforward, feedback and lateral connections in a recurrent network of spiking neurons. Operating in the context of a generative model for distributions of spike sequences, the learning mechanism is derived from variational inference principles. The synaptic plasticity rules found are interesting in that they are strongly reminiscent of experimental Spike Time Dependent Plasticity, and in that they differ for excitatory and inhibitory neurons. A simulation conﬁrms the method’s applicability to learning both stationary and temporal spike patterns. 1

3 0.22383904 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

Abstract: Bayesian ﬁltering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We ﬁnd that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1

4 0.21213241 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

Author: Yan Karklin, Eero P. Simoncelli

Abstract: Efﬁcient coding provides a powerful principle for explaining early sensory coding. Most attempts to test this principle have been limited to linear, noiseless models, and when applied to natural images, have yielded oriented ﬁlters consistent with responses in primary visual cortex. Here we show that an efﬁcient coding model that incorporates biologically realistic ingredients – input and output noise, nonlinear response functions, and a metabolic cost on the ﬁring rate – predicts receptive ﬁelds and response nonlinearities similar to those observed in the retina. Speciﬁcally, we develop numerical methods for simultaneously learning the linear ﬁlters and response nonlinearities of a population of model neurons, so as to maximize information transmission subject to metabolic costs. When applied to an ensemble of natural images, the method yields ﬁlters that are center-surround and nonlinearities that are rectifying. The ﬁlters are organized into two populations, with On- and Off-centers, which independently tile the visual space. As observed in the primate retina, the Off-center neurons are more numerous and have ﬁlters with smaller spatial extent. In the absence of noise, our method reduces to a generalized version of independent components analysis, with an adapted nonlinear “contrast” function; in this case, the optimal ﬁlters are localized and oriented.

5 0.192121 24 nips-2011-Active learning of neural response functions with Gaussian processes

Author: Mijung Park, Greg Horwitz, Jonathan W. Pillow

Abstract: A sizeable literature has focused on the problem of estimating a low-dimensional feature space for a neuron’s stimulus sensitivity. However, comparatively little work has addressed the problem of estimating the nonlinear function from feature space to spike rate. Here, we use a Gaussian process (GP) prior over the inﬁnitedimensional space of nonlinear functions to obtain Bayesian estimates of the “nonlinearity” in the linear-nonlinear-Poisson (LNP) encoding model. This approach offers increased ﬂexibility, robustness, and computational tractability compared to traditional methods (e.g., parametric forms, histograms, cubic splines). We then develop a framework for optimal experimental design under the GP-Poisson model using uncertainty sampling. This involves adaptively selecting stimuli according to an information-theoretic criterion, with the goal of characterizing the nonlinearity with as little experimental data as possible. Our framework relies on a method for rapidly updating hyperparameters under a Gaussian approximation to the posterior. We apply these methods to neural data from a color-tuned simple cell in macaque V1, characterizing its nonlinear response function in the 3D space of cone contrasts. We ﬁnd that it combines cone inputs in a highly nonlinear manner. With simulated experiments, we show that optimal design substantially reduces the amount of data required to estimate these nonlinear combination rules. 1

6 0.18679602 86 nips-2011-Empirical models of spiking in neural populations

7 0.18433948 219 nips-2011-Predicting response time and error rates in visual search

8 0.17695084 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

9 0.16688529 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations

10 0.16075684 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data

11 0.15876812 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

12 0.14776637 249 nips-2011-Sequence learning with hidden units in spiking neural networks

13 0.14318728 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis

14 0.12279493 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

15 0.12077452 75 nips-2011-Dynamical segmentation of single trials from population neural data

16 0.11476921 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

17 0.11398854 13 nips-2011-A blind sparse deconvolution method for neural spike identification

18 0.11366085 23 nips-2011-Active dendrites: adaptation to spike-based communication

19 0.11156512 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

20 0.10760715 303 nips-2011-Video Annotation and Tracking with Active Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.236), (1, 0.121), (2, 0.371), (3, -0.054), (4, 0.117), (5, 0.055), (6, -0.084), (7, -0.021), (8, -0.035), (9, 0.043), (10, -0.029), (11, 0.084), (12, -0.046), (13, 0.04), (14, 0.023), (15, 0.08), (16, 0.01), (17, 0.058), (18, 0.028), (19, -0.041), (20, -0.043), (21, -0.093), (22, 0.111), (23, -0.007), (24, -0.047), (25, -0.085), (26, -0.156), (27, -0.027), (28, 0.033), (29, 0.004), (30, -0.037), (31, 0.024), (32, 0.006), (33, 0.049), (34, 0.078), (35, -0.008), (36, 0.077), (37, 0.059), (38, 0.018), (39, 0.122), (40, 0.079), (41, 0.022), (42, 0.061), (43, -0.087), (44, 0.018), (45, 0.023), (46, 0.084), (47, -0.003), (48, 0.038), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96450937 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Author: Kamiar R. Rad, Liam Paninski

2 0.79852593 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

Author: Julie Dethier, Paul Nuyujukian, Chris Eliasmith, Terrence C. Stewart, Shauki A. Elasaad, Krishna V. Shenoy, Kwabena A. Boahen

Abstract: Motor prostheses aim to restore function to disabled patients. Despite compelling proof of concept systems, barriers to clinical translation remain. One challenge is to develop a low-power, fully-implantable system that dissipates only minimal power so as not to damage tissue. To this end, we implemented a Kalman-ﬁlter based decoder via a spiking neural network (SNN) and tested it in brain-machine interface (BMI) experiments with a rhesus monkey. The Kalman ﬁlter was trained to predict the arm’s velocity and mapped on to the SNN using the Neural Engineering Framework (NEF). A 2,000-neuron embedded Matlab SNN implementation runs in real-time and its closed-loop performance is quite comparable to that of the standard Kalman ﬁlter. The success of this closed-loop decoder holds promise for hardware SNN implementations of statistical signal processing algorithms on neuromorphic chips, which may offer power savings necessary to overcome a major obstacle to the successful clinical translation of neural motor prostheses. ∗ Present: Research Fellow F.R.S.-FNRS, Systmod Unit, University of Liege, Belgium. 1 1 Cortically-controlled motor prostheses: the challenge Motor prostheses aim to restore function for severely disabled patients by translating neural signals from the brain into useful control signals for prosthetic limbs or computer cursors. Several proof of concept demonstrations have shown encouraging results, but barriers to clinical translation still remain. One example is the development of a fully-implantable system that meets power dissipation constraints, but is still powerful enough to perform complex operations. A recently reported closedloop cortically-controlled motor prosthesis is capable of producing quick, accurate, and robust computer cursor movements by decoding neural signals (threshold-crossings) from a 96-electrode array in rhesus macaque premotor/motor cortex [1]-[4]. This, and previous designs (e.g., [5]), employ versions of the Kalman ﬁlter, ubiquitous in statistical signal processing. Such a ﬁlter and its variants are the state-of-the-art decoder for brain-machine interfaces (BMIs) in humans [5] and monkeys [2]. While these recent advances are encouraging, clinical translation of such BMIs requires fullyimplanted systems, which in turn impose severe power dissipation constraints. Even though it is an open, actively-debated question as to how much of the neural prosthetic system must be implanted, we note that there are no reports to date demonstrating a fully implantable 100-channel wireless transmission system, motivating performing decoding within the implanted chip. This computation is constrained by a stringent power budget: A 6 × 6mm2 implant must dissipate less than 10mW to avoid heating the brain by more than 1◦ C [6], which is believed to be important for long term cell health. With this power budget, current approaches can not scale to higher electrode densities or to substantially more computer-intensive decode/control algorithms. The feasibility of mapping a Kalman-ﬁlter based decoder algorithm [1]-[4] on to a spiking neural network (SNN) has been explored off-line (open-loop). In these off-line tests, the SNN’s performance virtually matched that of the standard implementation [7]. These simulations provide conﬁdence that this algorithm—and others similar to it—could be implemented using an ultra-low-power approach potentially capable of meeting the severe power constraints set by clinical translation. This neuromorphic approach uses very-large-scale integrated systems containing microelectronic analog circuits to morph neural systems into silicon chips [8, 9]. These neuromorphic circuits may yield tremendous power savings—50nW per silicon neuron [10]—over digital circuits because they use physical operations to perform mathematical computations (analog approach). When implemented on a chip designed using the neuromorphic approach, a 2,000-neuron SNN network can consume as little as 100µW. Demonstrating this approach’s feasibility in a closed-loop system running in real-time is a key, non-incremental step in the development of a fully implantable decoding chip, and is necessary before proceeding with fabricating and implanting the chip. As noise, delay, and over-ﬁtting play a more important role in the closed-loop setting, it is not obvious that the SNN’s stellar open-loop performance will hold up. In addition, performance criteria are different in the closed-loop and openloop settings (e.g., time per target vs. root mean squared error). Therefore, a SNN of a different size may be required to meet the desired speciﬁcations. Here we present results and assess the performance and viability of the SNN Kalman-ﬁlter based decoder in real-time, closed-loop tests, with the monkey performing a center-out-and-back target acquisition task. To achieve closed-loop operation, we developed an embedded Matlab implementation that ran a 2,000-neuron version of the SNN in real-time on a PC. We achieved almost a 50-fold speed-up by performing part of the computation in a lower-dimensional space deﬁned by the formal method we used to map the Kalman ﬁlter on to the SNN. This shortcut allowed us to run a larger SNN in real-time than would otherwise be possible. 2 Spiking neural network mapping of control theory algorithms As reported in [11], a formal methodology, called the Neural Engineering Framework (NEF), has been developed to map control-theory algorithms onto a computational fabric consisting of a highly heterogeneous population of spiking neurons simply by programming the strengths of their connections. These artiﬁcial neurons are characterized by a nonlinear multi-dimensional-vector-to-spikerate function—ai (x(t)) for the ith neuron—with parameters (preferred direction, maximum ﬁring rate, and spiking-threshold) drawn randomly from a wide distribution (standard deviation ≈ mean). 2 Spike rate (spikes/s) Representation ˆ x → ai (x) → x = ∑i ai (x)φix ˜ ai (x) = G(αi φix · x + Jibias ) 400 Transformation y = Ax → b j (Aˆ ) x Aˆ = ∑i ai (x)Aφix x x(t) B' y(t) A' 200 0 −1 Dynamics ˙ x = Ax → x = h ∗ A x A = τA + I 0 Stimulus x 1 bk(t) y(t) B' h(t) x(t) A' aj(t) Figure 1: NEF’s three principles. Representation. 1D tuning curves of a population of 50 leaky integrate-and-ﬁre neurons. The neurons’ tuning curves map control variables (x) to spike rates (ai (x)); this nonlinear transformation is inverted by linear weighted decoding. G() is the neurons’ nonlinear current-to-spike-rate function. Transformation. SNN with populations bk (t) and a j (t) representing y(t) and x(t). Feedforward and recurrent weights are determined by B and A , as described next. Dynamics. The system’s dynamics is captured in a neurally plausible fashion by replacing integration with the synapses’ spike response, h(t), and replacing the matrices with A = τA + I and B = τB to compensate. The neural engineering approach to conﬁguring SNNs to perform arbitrary computations is underlined by three principles (Figure 1) [11]-[14]: Representation is deﬁned by nonlinear encoding of x(t) as a spike rate, ai (x(t))—represented by the neuron tuning curve—combined with optimal weighted linear decoding of ai (x(t)) to recover ˆ an estimate of x(t), x(t) = ∑i ai (x(t))φix , where φix are the decoding weights. Transformation is performed by using alternate decoding weights in the decoding operation to map transformations of x(t) directly into transformations of ai (x(t)). For example, y(t) = Ax(t) is represented by the spike rates b j (Aˆ (t)), where unit j’s input is computed directly from unit i’s x output using Aˆ (t) = ∑i ai (x(t))Aφix , an alternative linear weighting. x Dynamics brings the ﬁrst two principles together and adds the time dimension to the circuit. This principle aims at reuniting the control-theory and neural levels by modifying the matrices to render the system neurally plausible, thereby permitting the synapses’ spike response, h(t), (i.e., impulse ˙ response) to capture the system’s dynamics. For example, for h(t) = τ −1 e−t/τ , x = Ax(t) is realized by replacing A with A = τA + I. This so-called neurally plausible matrix yields an equivalent dynamical system: x(t) = h(t) ∗ A x(t), where convolution replaces integration. The nonlinear encoding process—from a multi-dimensional stimulus, x(t), to a one-dimensional soma current, Ji (x(t)), to a ﬁring rate, ai (x(t))—is speciﬁed as: ai (x(t)) = G(Ji (x(t))). (1) Here G is the neurons’ nonlinear current-to-spike-rate function, which is given by G(Ji (x)) = τ ref − τ RC ln (1 − Jth /Ji (x)) −1 , (2) for the leaky integrate-and-ﬁre model (LIF). The LIF neuron has two behavioral regimes: subthreshold and super-threshold. The sub-threshold regime is described by an RC circuit with time constant τ RC . When the sub-threshold soma voltage reaches the threshold, Vth , the neuron emits a spike δ (t −tn ). After this spike, the neuron is reset and rests for τ ref seconds (absolute refractory period) before it resumes integrating. Jth = Vth /R is the minimum input current that produces spiking. Ignoring the soma’s RC time-constant when specifying the SNN’s dynamics are reasonable because the neurons cross threshold at a rate that is proportional to their input current, which thus sets the spike rate instantaneously, without any ﬁltering [11]. The conversion from a multi-dimensional stimulus, x(t), to a one-dimensional soma current, Ji , is ˜ performed by assigning to the neuron a preferred direction, φix , in the stimulus space and taking the dot-product: ˜ Ji (x(t)) = αi φix · x(t) + Jibias , (3) 3 where αi is a gain or conversion factor, and Jibias is a bias current that accounts for background ˜ activity. For a 1D space, φix is either +1 or −1 (drawn randomly), for ON and OFF neurons, respectively. The resulting tuning curves are illustrated in Figure 1, left. The linear decoding process is characterized by the synapses’ spike response, h(t) (i.e., post-synaptic currents), and the decoding weights, φix , which are obtained by minimizing the mean square error. A single noise term, η, takes into account all sources of noise, which have the effect of introducing uncertainty into the decoding process. Hence, the transmitted ﬁring rate can be written as ai (x(t)) + ηi , where ai (x(t)) represents the noiseless set of tuning curves and ηi is a random variable picked from a zero-mean Gaussian distribution with variance σ 2 . Consequently, the mean square error can be written as [11]: E = 1 ˆ [x(t) − x(t)]2 2 x,η,t = 2 1 2 x(t) − ∑ (ai (x(t)) + ηi ) φix i (4) x,η,t where · x,η denotes integration over the range of x and η, the expected noise. We assume that the noise is independent and has the same variance for each neuron [11], which yields: E= where σ2 1 2 2 x(t) − ∑ ai (x(t))φix i x,t 1 + σ 2 ∑(φix )2 , 2 i (5) is the noise variance ηi η j . This expression is minimized by: N φix = ∑ Γ−1 ϒ j , ij (6) j with Γi j = ai (x)a j (x) x + σ 2 δi j , where δ is the Kronecker delta function matrix, and ϒ j = xa j (x) x [11]. One consequence of modeling noise in the neural representation is that the matrix Γ is invertible despite the use of a highly overcomplete representation. In a noiseless representation, Γ is generally singular because, due to the large number of neurons, there is a high probability of having two neurons with similar tuning curves leading to two similar rows in Γ. 3 Kalman-ﬁlter based cortical decoder In the 1960’s, Kalman described a method that uses linear ﬁltering to track the state of a dynamical system throughout time using a model of the dynamics of the system as well as noisy measurements [15]. The model dynamics gives an estimate of the state of the system at the next time step. This estimate is then corrected using the observations (i.e., measurements) at this time step. The relative weights for these two pieces of information are given by the Kalman gain, K [15, 16]. Whereas the Kalman gain is updated at each iteration, the state and observation matrices (deﬁned below)—and corresponding noise matrices—are supposed constant. In the case of prosthetic applications, the system’s state vector is the cursor’s kinematics, xt = y [veltx , velt , 1], where the constant 1 allows for a ﬁxed offset compensation. The measurement vector, yt , is the neural spike rate (spike counts in each time step) of 192 channels of neural threshold crossings. The system’s dynamics is modeled by: xt yt = Axt−1 + wt , = Cxt + qt , (7) (8) where A is the state matrix, C is the observation matrix, and wt and qt are additive, Gaussian noise sources with wt ∼ N (0, W) and qt ∼ N (0, Q). The model parameters (A, C, W and Q) are ﬁt with training data by correlating the observed hand kinematics with the simultaneously measured neural signals (Figure 2). For an efﬁcient decoding, we derived the steady-state update equation by replacing the adaptive Kalman gain by its steady-state formulation: K = (I + WCQ−1 C)−1 W CT Q−1 . This yields the following estimate of the system’s state: xt = (I − KC)Axt−1 + Kyt = MDT xt−1 + MDT yt , x y 4 (9) a Velocity (cm/s) Neuron 10 c 150 5 100 b 50 20 0 −20 0 0 x−velocity y−velocity 2000 4000 6000 8000 Time (ms) 10000 12000 1cm 14000 Trials: 0034-0049 Figure 2: Neural and kinematic measurements (monkey J, 2011-04-16, 16 continuous trials) used to ﬁt the standard Kalman ﬁlter model. a. The 192 cortical recordings fed as input to ﬁt the Kalman ﬁlter’s matrices (color code refers to the number of threshold crossings observed in each 50ms bin). b. Hand x- and y-velocity measurements correlated with the neural data to obtain the Kalman ﬁlter’s matrices. c. Cursor kinematics of 16 continuous trials under direct hand control. where MDT = (I − KC)A and MDT = K are the discrete time (DT) Kalman matrices. The steadyx y state formulation improves efﬁciency with little loss in accuracy because the optimal Kalman gain rapidly converges (typically less than 100 iterations). Indeed, in neural applications under both open-loop and closed-loop conditions, the difference between the full Kalman ﬁlter and its steadystate implementation falls to within 1% in a few seconds [17]. This simplifying assumption reduces the execution time for decoding a typical neuronal ﬁring rate signal approximately seven-fold [17], a critical speed-up for real-time applications. 4 Kalman ﬁlter with a spiking neural network To implement the Kalman ﬁlter with a SNN by applying the NEF, we ﬁrst convert Equation 9 from DT to continuous time (CT), and then replace the CT matrices with neurally plausible ones, which yields: x(t) = h(t) ∗ A x(t) + B y(t) , (10) where A = τMCT + I, B = τMCT , with MCT = MDT − I /∆t and MCT = MDT /∆t, the CT x y x x y y Kalman matrices, and ∆t = 50ms, the discrete time step; τ is the synaptic time-constant. The jth neuron’s input current (see Equation 3) is computed from the system’s current state, x(t), which is computed from estimates of the system’s previous state (ˆ (t) = ∑i ai (t)φix ) and current x y input (ˆ (t) = ∑k bk (t)φk ) using Equation 10. This yields: y ˜x J j (x(t)) = α j φ j · x(t) + J bias j ˜x ˆ ˆ = α j φ j · h(t) ∗ A x(t) + B y(t) ˜x = α j φ j · h(t) ∗ A + J bias j ∑ ai (t)φix + B ∑ bk (t)φky i + J bias j (11) k This last equation can be written in a neural network form: J j (x(t)) = h(t) ∗ ∑ ω ji ai (t) + ∑ ω jk bk (t) i + J bias j (12) k y ˜x ˜x where ω ji = α j φ j A φix and ω jk = α j φ j B φk are the recurrent and feedforward weights, respectively. 5 Efﬁcient implementation of the SNN In this section, we describe the two distinct steps carried out when implementing the SNN: creating and running the network. The ﬁrst step has no computational constraints whereas the second must be very efﬁcient in order to be successfully deployed in the closed-loop experimental setting. 5 x ( 1000 x ( = 1000 1000 = 1000 x 1000 b 1000 x 1000 1000 a Figure 3: Computing a 1000-neuron pool’s recurrent connections. a. Using connection weights requires multiplying a 1000×1000 matrix by a 1000 ×1 vector. b. Operating in the lower-dimensional state space requires multiplying a 1 × 1000 vector by a 1000 × 1 vector to get the decoded state, multiplying this state by a component of the A matrix to update it, and multiplying the updated state by a 1000 × 1 vector to re-encode it as ﬁring rates, which are then used to update the soma current for every neuron. Network creation: This step generates, for a speciﬁed number of neurons composing the network, x ˜x the gain α j , bias current J bias , preferred direction φ j , and decoding weight φ j for each neuron. The j ˜x preferred directions φ j are drawn randomly from a uniform distribution over the unit sphere. The maximum ﬁring rate, max G(J j (x)), and the normalized x-axis intercept, G(J j (x)) = 0, are drawn randomly from a uniform distribution on [200, 400] Hz and [-1, 1], respectively. From these two speciﬁcations, α j and J bias are computed using Equation 2 and Equation 3. The decoding weights j x φ j are computed by minimizing the mean square error (Equation 6). For efﬁcient implementation, we used two 1D integrators (i.e., two recurrent neuron pools, with each pool representing a scalar) rather than a single 3D integrator (i.e., one recurrent neuron pool, with the pool representing a 3D vector by itself) [13]. The constant 1 is fed to the 1D integrators as an input, rather than continuously integrated as part of the state vector. We also replaced the bk (t) units’ spike rates (Figure 1, middle) with the 192 neural measurements (spike counts in 50ms bins), y which is equivalent to choosing φk from a standard basis (i.e., a unit vector with 1 at the kth position and 0 everywhere else) [7]. Network simulation: This step runs the simulation to update the soma current for every neuron, based on input spikes. The soma voltage is then updated following RC circuit dynamics. Gaussian noise is normally added at this step, the rest of the simulation being noiseless. Neurons with soma voltage above threshold generate a spike and enter their refractory period. The neuron ﬁring rates are decoded using the linear decoding weights to get the updated states values, x and y-velocity. These values are smoothed with a ﬁlter identical to h(t), but with τ set to 5ms instead of 20ms to avoid introducing signiﬁcant delay. Then the simulation step starts over again. In order to ensure rapid execution of the simulation step, neuron interactions are not updated dix rectly using the connection matrix (Equation 12), but rather indirectly with the decoding matrix φ j , ˜x dynamics matrix A , and preferred direction matrix φ j (Equation 11). To see why this is more efﬁcient, suppose we have 1000 neurons in the a population for each of the state vector’s two scalars. Computing the recurrent connections using connection weights requires multiplying a 1000 × 1000 matrix by a 1000-dimensional vector (Figure 3a). This requires 106 multiplications and about 106 sums. Decoding each scalar (i.e., ∑i ai (t)φix ), however, requires only 1000 multiplications and 1000 sums. The decoded state vector is then updated by multiplying it by the (diagonal) A matrix, another 2 products and 1 sum. The updated state vector is then encoded by multiplying it with the neurons’ preferred direction vectors, another 1000 multiplications per scalar (Figure 3b). The resulting total of about 3000 operations is nearly three orders of magnitude fewer than using the connection weights to compute the identical transformation. To measure the speedup, we simulated a 2,000-neuron network on a computer running Matlab 2011a (Intel Core i7, 2.7-GHz, Mac OS X Lion). Although the exact run-times depend on the computing hardware and software, the run-time reduction factor should remain approximately constant across platforms. For each reported result, we ran the simulation 10 times to obtain a reliable estimate of the execution time. The run-time for neuron interactions using the recurrent connection weights was 9.9ms and dropped to 2.7µs in the lower-dimensional space, approximately a 3,500-fold speedup. Only the recurrent interactions beneﬁt from the speedup, the execution time for the rest of the operations remaining constant. The run-time for a 50ms network simulation using the recurrent connec6 Table 1: Model parameters Symbol max G(J j (x)) G(J j (x)) = 0 J bias j αj ˜x φj Range 200-400 Hz −1 to 1 Satisﬁes ﬁrst two Satisﬁes ﬁrst two ˜x φj = 1 Description Maximum ﬁring rate Normalized x-axis intercept Bias current Gain factor Preferred-direction vector σ2 τ RC j τ ref j τ PSC j 0.1 20 ms 1 ms 20 ms Gaussian noise variance RC time constant Refractory period PSC time constant tion weights was 0.94s and dropped to 0.0198s in the lower-dimensional space, a 47-fold speedup. These results demonstrate the efﬁciency the lower-dimensional space offers, which made the closedloop application of SNNs possible. 6 Closed-loop implementation An adult male rhesus macaque (monkey J) was trained to perform a center-out-and-back reaching task for juice rewards to one of eight targets, with a 500ms hold time (Figure 4a) [1]. All animal protocols and procedures were approved by the Stanford Institutional Animal Care and Use Committee. Hand position was measured using a Polaris optical tracking system at 60Hz (Northern Digital Inc.). Neural data were recorded from two 96-electrode silicon arrays (Blackrock Microsystems) implanted in the dorsal pre-motor and motor cortex. These recordings (-4.5 RMS threshold crossing applied to each electrode’s signal) yielded tuned activity for the direction and speed of arm movements. As detailed in [1], a standard Kalman ﬁlter model was ﬁt by correlating the observed hand kinematics with the simultaneously measured neural signals, while the monkey moved his arm to acquire virtual targets (Figure 2). The resulting model was used in a closed-loop system to control an on-screen cursor in real-time (Figure 4a, Decoder block). A steady-state version of this model serves as the standard against which the SNN implementation’s performance is compared. We built a SNN using the NEF methodology based on derived Kalman ﬁlter parameters mentioned above. This SNN was then simulated on an xPC Target (Mathworks) x86 system (Dell T3400, Intel Core 2 Duo E8600, 3.33GHz). It ran in closed-loop, replacing the standard Kalman ﬁlter as the decoder block in Figure 4a. The parameter values listed in Table 1 were used for the SNN implementation. We ensured that the time constants τiRC ,τiref , and τiPSC were smaller than the implementation’s time step (50ms). Noise was not explicitly added. It arose naturally from the ﬂuctuations produced by representing a scalar with ﬁltered spike trains, which has been shown to have effects similar to Gaussian noise [11]. For the purpose of computing the linear decoding weights (i.e., Γ), we modeled the resulting noise as Gaussian with a variance of 0.1. A 2,000-neuron version of the SNN-based decoder was tested in a closed-loop system, the largest network our embedded MatLab implementation could run in real-time. There were 1206 trials total among which 301 (center-outs only) were performed with the SNN and 302 with the standard (steady-state) Kalman ﬁlter. The block structure was randomized and interleaved, so that there is no behavioral bias present in the ﬁndings. 100 trials under hand control are used as a baseline comparison. Success corresponds to a target acquisition under 1500ms, with 500ms hold time. Success rates were higher than 99% on all blocks for the SNN implementation and 100% for the standard Kalman ﬁlter. The average time to acquire the target was slightly slower for the SNN (Figure 5b)—711ms vs. 661ms, respectively—we believe this could be improved by using more neurons in the SNN.1 The average distance to target (Figure 5a) and the average velocity of the cursor (Figure 5c) are very similar. 1 Off-line, the SNN performed better as we increased the number of neurons [7]. 7 a Neural Spikes b c BMI: Kalman decoder BMI: SNN decoder Decoder Cursor Velocity 1cm 1cm Trials: 2056-2071 Trials: 1748-1763 5 0 0 400 Time after Target Onset (ms) 800 Target acquisition time histogram 40 Mean cursor velocity 50 Standard Kalman filter 40 20 Hand 30 30 Spiking Neural Network 20 10 0 c Cursor Velocity (cm/s) b Mean distance to target 10 Percent of Trials (%) a Distance to Target (cm) Figure 4: Experimental setup and results. a. Data are recorded from two 96-channel silicon electrode arrays implanted in dorsal pre-motor and motor cortex of an adult male monkey performing a centerout-and-back reach task for juice rewards to one of eight targets with a 500ms hold time. b. BMI position kinematics of 16 continuous trials for the standard Kalman ﬁlter implementation. c. BMI position kinematics of 16 continuous trials for the SNN implementation. 10 0 500 1000 Target Acquire Time (ms) 1500 0 0 200 400 600 800 Time after Target Onset (ms) 1000 Figure 5: SNN (red) performance compared to standard Kalman ﬁlter (blue) (hand trials are shown for reference (yellow)). The SNN achieves similar results—success rates are higher than 99% on all blocks—as the standard Kalman ﬁlter implementation. a. Plot of distance to target vs. time both after target onset for different control modalities. The thicker traces represent the average time when the cursor ﬁrst enters the acceptance window until successfully entering for the 500ms hold time. b. Histogram of target acquisition time. c. Plot of mean cursor velocity vs. time. 7 Conclusions and future work The SNN’s performance was quite comparable to that produced by a standard Kalman ﬁlter implementation. The 2,000-neuron network had success rates higher than 99% on all blocks, with mean distance to target, target acquisition time, and mean cursor velocity curves very similar to the ones obtained with the standard implementation. Future work will explore whether these results extend to additional animals. As the Kalman ﬁlter and its variants are the state-of-the-art in cortically-controlled motor prostheses [1]-[5], these simulations provide conﬁdence that similar levels of performance can be attained with a neuromorphic system, which can potentially overcome the power constraints set by clinical applications. Our ultimate goal is to develop an ultra-low-power neuromorphic chip for prosthetic applications on to which control theory algorithms can be mapped using the NEF. As our next step in this direction, we will begin exploring this mapping with Neurogrid, a hardware platform with sixteen programmable neuromorphic chips that can simulate up to a million spiking neurons in real-time [9]. However, bandwidth limitations prevent Neurogrid from realizing random connectivity patterns. It can only connect each neuron to thousands of others if neighboring neurons share common inputs — just as they do in the cortex. Such columnar organization may be possible with NEF-generated networks if preferred directions vectors are assigned topographically rather than randomly. Implementing this constraint effectively is a subject of ongoing research. Acknowledgment This work was supported in part by the Belgian American Education Foundation(J. Dethier), Stanford NIH Medical Scientist Training Program (MSTP) and Soros Fellowship (P. Nuyujukian), DARPA Revolutionizing Prosthetics program (N66001-06-C-8005, K. V. Shenoy), and two NIH Director’s Pioneer Awards (DP1-OD006409, K. V. Shenoy; DPI-OD000965, K. Boahen). 8 References [1] V. Gilja, Towards clinically viable neural prosthetic systems, Ph.D. Thesis, Department of Computer Science, Stanford University, 2010, pp 19–22 and pp 57–73. [2] V. Gilja, P. Nuyujukian, C.A. Chestek, J.P. Cunningham, J.M. Fan, B.M. Yu, S.I. Ryu, and K.V. Shenoy, A high-performance continuous cortically-controlled prosthesis enabled by feedback control design, 2010 Neuroscience Meeting Planner, San Diego, CA: Society for Neuroscience, 2010. [3] P. Nuyujukian, V. Gilja, C.A. Chestek, J.P. Cunningham, J.M. Fan, B.M. Yu, S.I. Ryu, and K.V. Shenoy, Generalization and robustness of a continuous cortically-controlled prosthesis enabled by feedback control design, 2010 Neuroscience Meeting Planner, San Diego, CA: Society for Neuroscience, 2010. [4] V. Gilja, C.A. Chestek, I. Diester, J.M. Henderson, K. Deisseroth, and K.V. Shenoy, Challenges and opportunities for next-generation intra-cortically based neural prostheses, IEEE Transactions on Biomedical Engineering, 2011, in press. [5] S.P. Kim, J.D. Simeral, L.R. Hochberg, J.P. Donoghue, and M.J. Black, Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia, Journal of Neural Engineering, vol. 5, 2008, pp 455–476. [6] S. Kim, P. Tathireddy, R.A. Normann, and F. Solzbacher, Thermal impact of an active 3-D microelectrode array implanted in the brain, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 15, 2007, pp 493–501. [7] J. Dethier, V. Gilja, P. Nuyujukian, S.A. Elassaad, K.V. Shenoy, and K. Boahen, Spiking neural network decoder for brain-machine interfaces, IEEE Engineering in Medicine & Biology Society Conference on Neural Engineering, Cancun, Mexico, 2011, pp 396–399. [8] K. Boahen, Neuromorphic microchips, Scientiﬁc American, vol. 292(5), 2005, pp 56–63. [9] R. Silver, K. Boahen, S. Grillner, N. Kopell, and K.L. Olsen, Neurotech for neuroscience: unifying concepts, organizing principles, and emerging tools, Journal of Neuroscience, vol. 27(44), 2007, pp 11807– 11819. [10] J.V. Arthur and K. Boahen, Silicon neuron design: the dynamical systems approach, IEEE Transactions on Circuits and Systems, vol. 58(5), 2011, pp 1034-1043. [11] C. Eliasmith and C.H. Anderson, Neural engineering: computation, representation, and dynamics in neurobiological systems, MIT Press, Cambridge, MA; 2003. [12] C. Eliasmith, A uniﬁed approach to building and controlling spiking attractor networks, Neural Computation, vol. 17, 2005, pp 1276–1314. [13] R. Singh and C. Eliasmith, Higher-dimensional neurons explain the tuning and dynamics of working memory cells, The Journal of Neuroscience, vol. 26(14), 2006, pp 3667–3678. [14] C. Eliasmith, How to build a brain: from function to implementation, Synthese, vol. 159(3), 2007, pp 373–388. [15] R.E. Kalman, A new approach to linear ﬁltering and prediction problems, Transactions of the ASME– Journal of Basic Engineering, vol. 82(Series D), 1960, pp 35–45. [16] G. Welsh and G. Bishop, An introduction to the Kalman Filter, University of North Carolina at Chapel Hill Chapel Hill NC, vol. 95(TR 95-041), 1995, pp 1–16. [17] W.Q. Malik, W. Truccolo, E.N. Brown, and L.R. Hochberg, Efﬁcient decoding with steady-state Kalman ﬁlter in neural interface systems, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 19(1), 2011, pp 25–34. 9

3 0.74431974 86 nips-2011-Empirical models of spiking in neural populations

Author: Jakob H. Macke, Lars Buesing, John P. Cunningham, Byron M. Yu, Krishna V. Shenoy, Maneesh Sahani

Abstract: Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multi-electrode recording makes it possible to access these population processes empirically by ﬁtting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where ﬁring exhibits extensive correlations in both time and space and where a typical sample of neurons still reﬂects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We ﬁnd that the latent dynamical approach outperforms the GLM in terms of goodness-ofﬁt, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, ﬁnding that the non-Gaussian model provides slightly better goodness-of-ﬁt and more realistic population spike counts. 1

4 0.73407018 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

Author: Konrad Koerding, Ian Stevenson

Abstract: Synaptic plasticity underlies learning and is thus central for development, memory, and recovery from injury. However, it is often difﬁcult to detect changes in synaptic strength in vivo, since intracellular recordings are experimentally challenging. Here we present two methods aimed at inferring changes in the coupling between pairs of neurons from extracellularly recorded spike trains. First, using a generalized bilinear model with Poisson output we estimate time-varying coupling assuming that all changes are spike-timing-dependent. This approach allows model-based estimation of STDP modiﬁcation functions from pairs of spike trains. Then, using recursive point-process adaptive ﬁltering methods we estimate more general variation in coupling strength over time. Using simulations of neurons undergoing spike-timing dependent modiﬁcation, we show that the true modiﬁcation function can be recovered. Using multi-electrode data from motor cortex we then illustrate the use of this technique on in vivo data. 1

5 0.70158702 302 nips-2011-Variational Learning for Recurrent Spiking Networks

Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner

6 0.68292409 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations

7 0.66452026 24 nips-2011-Active learning of neural response functions with Gaussian processes

8 0.66173744 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

9 0.66032511 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

10 0.62159967 75 nips-2011-Dynamical segmentation of single trials from population neural data

11 0.590608 219 nips-2011-Predicting response time and error rates in visual search

12 0.58567655 23 nips-2011-Active dendrites: adaptation to spike-based communication

13 0.57673538 85 nips-2011-Emergence of Multiplication in a Biophysical Model of a Wide-Field Visual Neuron for Computing Object Approaches: Dynamics, Peaks, & Fits

14 0.56976599 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

15 0.54354739 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data

16 0.53970802 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models

17 0.52943856 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis

18 0.51768339 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

19 0.49833912 249 nips-2011-Sequence learning with hidden units in spiking neural networks

20 0.49307573 123 nips-2011-How biased are maximum entropy models?

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (1, 0.112), (4, 0.037), (20, 0.028), (26, 0.024), (31, 0.099), (43, 0.13), (45, 0.074), (57, 0.082), (63, 0.012), (65, 0.022), (74, 0.036), (83, 0.156), (84, 0.033), (99, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91068816 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Author: Kamiar R. Rad, Liam Paninski

2 0.84108347 13 nips-2011-A blind sparse deconvolution method for neural spike identification

Author: Chaitanya Ekanadham, Daniel Tranchina, Eero P. Simoncelli

Abstract: We consider the problem of estimating neural spikes from extracellular voltage recordings. Most current methods are based on clustering, which requires substantial human supervision and systematically mishandles temporally overlapping spikes. We formulate the problem as one of statistical inference, in which the recorded voltage is a noisy sum of the spike trains of each neuron convolved with its associated spike waveform. Joint maximum-a-posteriori (MAP) estimation of the waveforms and spikes is then a blind deconvolution problem in which the coefﬁcients are sparse. We develop a block-coordinate descent procedure to approximate the MAP solution, based on our recently developed continuous basis pursuit method. We validate our method on simulated data as well as real data for which ground truth is available via simultaneous intracellular recordings. In both cases, our method substantially reduces the number of missed spikes and false positives when compared to a standard clustering algorithm, primarily by recovering overlapping spikes. The method offers a fully automated alternative to clustering methods that is less susceptible to systematic errors. 1

3 0.82872915 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

Author: Konrad Koerding, Ian Stevenson

4 0.82833421 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

Author: Omar Z. Khan, Pascal Poupart, John-mark M. Agosta

Abstract: In this paper, we derive a method to reﬁne a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test’s relative diagnostic value. We demonstrate that consistency with an expert’s test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to ﬁnd a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario. 1 INTRODUCTION The problem of learning-by-example has the promise to create strong models from a restricted number of cases; certainly humans show the ability to generalize from limited experience. Machine Learning has seen numerous approaches to learning task performance by imitation, going back to some of the approaches to inductive learning from examples [14]. Of particular interest are problemsolving tasks that use a model to infer the source, or cause of a problem from a sequence of investigatory steps or tests. The speciﬁc example we adopt is a diagnostic task such as appears in medicine, electro-mechanical fault isolation, customer support and network diagnostics, among others. We deﬁne a diagnostic sequence as consisting of the assignment of values to a subset of tests. The diagnostic process embodies the choice of the best next test to execute at each step in the sequence, by measuring the diagnostic value among the set of available tests at each step, that is, the ability of a test to distinguish among the possible causes. One possible implementation with which to carry out this process, the one we apply, is a Bayes network [9]. As with all model-based approaches, provisioning an adequate model can be daunting, resulting in a “knowledge elicitation bottleneck.” A recent approach for easing the bottleneck grew out of the realization that the best time to gain an expert’s insight into the model structure is during the diagnostic process. Recent work in “QueryBased Diagnostics” [1] demonstrated a way to improve model quality by merging model use and model building into a single process. More precisely the expert can take steps to modify the network structure to add or remove nodes or links, interspersed within the diagnostic sequence. In this paper we show how to extend this variety of learning-by-example to include also reﬁnement of model parameters based on the expert’s choice of test, from which we determine constraints. The nature of these constraints, as shown herein, is derived from the value of the tests to distinguish causes, a value referred to informally as value of information [10]. It is the effect of these novel constraints on network parameter learning that is elucidated in this paper. ∗ J. M. Agosta is no longer afﬁliated with Intel Corporation 1 Conventional statistical learning approaches are not suited to this problem, since the number of cases available from diagnostic sessions is small, and the data from any case is sparse. (Only a fraction of the tests are taken.) But more relevant is that one diagnostic sequence from an expert user represents the true behavior expected of the model, rather than a noisy realization of a case generated by the true model. We adopt a Bayesian approach, which offers a principled way to incorporate knowledge (constraints and data, when available) and also consider weakening the constraints, by applying a likelihood to them, so that possibly conﬂicting constraints can be incorporated consistently. Sec. 2 reviews related work and Sec. 3 provides some background on diagnostic networks and model consistency. Then, Sec. 4 describes an augmented Bayesian network that incorporates constraints implied by an expert’s choice of tests. Some sampling techniques are proposed to ﬁnd the Maximum a posterior setting of the parameters given the constraints (and any data available). The approach is evaluated in Sec. 5 on synthetic data and a real world manufacturing diagnostic scenario. Finally, Sec. 6 discusses some future work. 2 RELATED WORK Parameter learning for Bayesian networks can be viewed as searching in a high-dimensional space. Adopting constraints on the parameters based on some domain knowledge is a way of pruning this search space and learning the parameters more efﬁciently, both in terms of data needed and time required. Qualitative probabilistic networks [17] allow qualitative constraints on the parameter space to be speciﬁed by experts. For instance, the inﬂuence of one variable on another, or the combined inﬂuence of multiple variables on another variable [5] leads to linear inequalities on the parameters. Wittig and Jameson [18] explain how to transform the likelihood of violating qualitative constraints into a penalty term to adjust maximum likelihood, which allows gradient ascent and Expectation Maximization (EM) to take into account linear qualitative constraints. Other examples of qualitative constraints include some parameters being larger than others, bounded in a range, within ϵ of each other, etc. Various proposals have been made that exploit such constraints. Altendorf et al. [2] provide an approximate technique based on constrained convex optimization for parameter learning. Niculescu et al. [15] also provide a technique based on constrained optimization with closed form solutions for different classes of constraints. Feelders [6] provides an alternate method based on isotonic regression while Liao and Ji [12] combine gradient descent with EM. de Campos and Ji [4] also use constrained convex optimization, however, they use Dirichlet priors on the parameters to incorporate any additional knowledge. Mao and Lebanon [13] also use Dirichlet priors, but they use probabilistic constraints to allow inaccuracies in the speciﬁcation of the constraints. A major difference between our technique and previous work is on the type of constraints. Our constraints do not need to be explicitly speciﬁed by an expert. Instead, we passively observe the expert and learn from what choices are made and not made [16]. Furthermore, as we shall show later, our constraints are non-convex, preventing the direct application of existing techniques that assume linear or convex functions. We use Beta priors on the parameters, which can easily be extended to Dirichlet priors like previous work. We incorporate constraints in an augmented Bayesian network, similar to Liang et al. [11], though their constraints are on model predictions as opposed to ours which are on the parameters of the network. Finally, we also use the notion of probabilistic constraints to handle potential mistakes made by experts. 3 3.1 BACKGROUND DIAGNOSTIC BAYES NETWORKS We consider the class of bipartite Bayes networks that are widely used as diagnostic models, though our approach can be used for networks with any structure. The network forms a sparse, directed, causal graph, where arcs go from causes to observable node variables. We use upper case to denote random variables; C for causes, and T for observables (tests). Lower case letters denote values in the domain of a variable, e.g. c ∈ dom(C) = {c, c}, and bold letters denote sets of variables. A ¯ set of marginally independent binary-valued node variables C with distributions Pr(C) represent unobserved causes, and condition the remaining conditionally independent binary-valued test vari2 able nodes T. Each cause conditions one or more tests; likewise each test is conditioned by one or more causes, resulting in a graph with one or more possibly multiply-connected components. The test variable distributions Pr(T |C) incorporate the further modeling assumption of Independence of Causal Inﬂuence, the most familiar example being the Noisy-Or model [8]. To keep the exposition simple, we assume that all variables are binary and that conditional distributions are parametrized by the Noisy-Or; however, the algorithms described in the rest of the paper generalize to any discrete non-binary variable models. Conventionally, unobserved tests are ranked in a diagnostic Bayes network by their Value Of Information (VOI) conditioned on tests already observed. To be precise, VOI is the expected gain in utility if the test were to be observed. The complete computation requires a model equivalent to a partially observable Markov decision process. Instead, VOI is commonly approximated by a greedy computation of the Mutual Information between a test and the set of causes [3]. In this case, it is easy to show that Mutual Information is in turn well approximated to second order by the Gini impurity [7] as shown in Equation 1. ] [∑ ∑ GI(C|T ) = Pr(T = t) Pr(C = c|T = t)(1 − Pr(C = c|T = t)) (1) t c We will use the Gini measure as a surrogate for VOI, as a way to rank the best next test in the diagnostic sequence. 3.2 MODEL CONSISTENCY A model that is consistent with an expert would generate Gini impurity rankings consistent with the expert’s diagnostic sequence. We interpret the expert’s test choices as implying constraints on Gini impurity rankings between tests. To that effect, [1] deﬁnes the notion of Cause Consistency and Test Consistency, which indicate whether the cause and test orderings induced by the posterior distribution over causes and the VOI of each test agree with an expert’s observed choice. Assuming that the expert greedily chooses the most informative test T ∗ (i.e., test that yields the lowest Gini impurity) at each step, then the model is consistent with the expert’s choices when the following constraints are satisﬁed: GI(C|T ∗ ) ≤ GI(C|Ti ) ∀i (2) We demonstrate next how to exploit these constraints to reﬁne the Bayes network. 4 MODEL REFINEMENT Consider a simple diagnosis example with two possible causes C1 and C2 and two tests T1 and T2 as shown in Figure 1. To keep the exposition simple, suppose that the priors for each cause are known (generally separate data is available to estimate these), but the conditional distribution of each test is unknown. Using the Noisy-OR parameterizations for the conditional distributions, the number of parameters are linear in the number of parents instead of exponential. ∏ i i Pr(Ti = true|C) = 1 − (1 − θ0 ) (1 − θj ) (3) j|Cj =true i Here, θ0 = Pr(Ti = true|Cj = f alse ∀j) is the leak probability that Ti will be true when none of i the causes are true and θj = Pr(Ti = true|Cj = true, Ck = f alse ∀k ̸= j) is the link reliability, which indicates the independent contribution of cause Cj to the probability that test Ti will be true. In the rest of this section, we describe how to learn the θ parameters while respecting the constraints implied by test consistency. 4.1 TEST CONSISTENCY CONSTRAINTS Suppose that an expert chooses test T1 instead of test T2 during the diagnostic process. This ordering by the expert implies that the current model (parametrized by the θ’s) must be consistent with the constraint GI(C|T2 ) − GI(C|T1 ) ≥ 0. Using the deﬁnition of Gini impurity in Eq. 1, we can rewrite 3 Figure 1: Network with 2 causes and 2 tests Figure 2: Augmented network with parameters and constraints Figure 3: Augmented network extended to handle inaccurate feedback the constraint for the network shown in Fig. 1 as follows: ∑ t1 ( ∑ (Pr(t1 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 Pr(t1 ) − Pr(t1 ) c ,c 1 2 ) ( ) ∑ ∑ (Pr(t2 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 − Pr(t2 ) − ≥0 Pr(t2 ) t c ,c 2 1 2 (4) Furthermore, using the Noisy-Or encoding from Eq. 3, we can rewrite the constraint as a polynomial in the θ’s. This polynomial is non-linear, and in general, not concave. The feasible space may consist of disconnected regions. Fig. 4 shows the surface corresponding to the polynomial for the 2 1 i i case where θ0 = 0 and θ1 = 0.5 for each test i, which leaves θ2 and θ2 as the only free variables. The parameters’ feasible space, satisfying the constraint consists of the two disconnected regions where the surface is positive. 4.2 AUGMENTED BAYES NETWORK Our objective is to learn the θ parameters of diagnostic Bayes networks given test constraints of the form described in Eq. 4. To deal with non-convex constraints and disconnected feasible regions, we pursue a Bayesian approach whereby we explicitly model the parameters and constraints as random variables in an augmented Bayes network (see Fig. 2). This allows us to frame the problem of learning the parameters as an inference problem in a hybrid Bayes network of discrete (T, C, V ) and continuous (Θ) variables. As we will see shortly, this augmented Bayes network provides a unifying framework to simultaneously learn from constraints and data, to deal with possibly inconsistent constraints, and to express preferences over the degree of satisfaction of the constraints. We encode the constraint derived from the expert feedback as a binary random variable V in the Bayes network. If V is true the constraint is satisﬁed, otherwise it is violated. Thus, if V is true then Θ lies in the positive region of Fig. 4, and if V is f alse then Θ lies in the negative region. We model the CPT for V as Pr(V |Θ) = max(0, π), where π = GI(C|T1 ) − GI(C|T2 ). Note that the value of GI(C|T ) lies in the interval [0,1], so the probability π will always be normalized. The intuition behind this deﬁnition of the CPT for V is that a constraint is more likely to be satisﬁed if the parameters lie in the interior of the constraint region. We place a Beta prior over each Θ parameter. Since the test variables are conditioned on the Θ parameters that are now part of the network, their conditional distributions become known. For instance, the conditional distribution for Ti (given in Eq. 3) is fully deﬁned given the noisy-or parami eters θj . Hence the problem of learning the parameters becomes an inference problem to compute posteriors over the parameters given that the constraint is satisﬁed (and any data). In practice, it is more convenient to obtain a single value for the parameters instead of a posterior distribution since it is easier to make diagnostic predictions based on one Bayes network. We estimate the parameters by computing a maximum a posteriori (MAP) hypothesis given that the constraint is satisﬁed (and any data): Θ∗ = arg maxΘ Pr(Θ|V = true). 4 Algorithm 1 Pseudo Code for Gibbs Sampling, Stochastic Hill Climbing and Greedy Search 1 Fix observed variables, let V = true and randomly sample feasible starting state S 2 for i = 1 to #samples 3 for j = 1 to #hiddenV ariables 4 acceptSample = f alse; k = 0 5 repeat 6 Sample s′ from conditional of j th hidden variable Sj 7 S′ = S; Sj = s′ 8 if Sj is cause or test, then acceptSample = true 9 elseif S′ obeys constraints V∗ 10 if algo == Gibbs 11 Sample u from uniform distribution, U(0,1) p(S′ 12 if u < M q(S)′ ) where p and q are the true and proposal distributions and M > 1 13 acceptSample = true 14 elseif algo = = StochasticHillClimbing 15 if likelihood(S′ ) > likelihood(S), then acceptSample = true 16 elseif algo = = Greedy, then acceptSample = true 17 elseif algo = = Greedy 18 k = k+1 19 if k = = maxIterations, then s′ = Sj ; acceptSample = true 20 until acceptSample = = true 21 Sj = s′ 4.3 MAP ESTIMATION Previous approaches for parameter learning with domain knowledge include modiﬁed versions of EM or some other optimization techniques that account for linear/convex constraints on the parameters. Since our constraints are non-convex, we propose a new approach based on Gibbs sampling to approximate the posterior distribution, from which we compute the MAP estimate. Although the technique converges to the MAP in the limit, it may require excessive time. Hence, we modify Gibbs sampling to obtain more efﬁcient stochastic hill climbing and greedy search algorithms with anytime properties. The pseudo code for our Gibbs sampler is provided in Algorithm 1. The two key steps are sampling the conditional distributions of each variable (line 6) and rejection sampling to ensure that the constraints are satisﬁed (lines 9 and 12). We sample each variable given the rest according to the following distributions: ti ∼ Pr(Ti |c, θi ) ∀i cj ∼ Pr(Cj |c − cj , t, θ) ∝ ∏ Pr(Cj ) j ∏ (5) Pr(ti |c, θi ) ∀j i i θj ∼ Pr(Θi |Θ − Θi , t, c, v) ∝ Pr(v|t, Θ) j j ∏ Pr(ti |cj , θi ) ∀i, j (6) (7) i The tests and causes are easily sampled from the multinomials as described in the equations above. However, sampling the θ’s is more difﬁcult due to the factor Pr(v|Θ, t) = max(0, π), which is a truncated mixture of Betas. So, instead of sampling θ from its true conditional, we sample it from a proposal distribution that replaces max(0, π) by an un-truncated mixture of Betas equal to π + a where a is a constant that ensures that π + a is always positive. This is equivalent to ignoring the constraints. Then we ensure that the constraints are satisﬁed by rejecting the samples that violate the constraints. Once Gibbs sampling has been performed, we obtain a sample that approximates the posterior distribution over the parameters given the constraints (and any data). We return a single setting of the parameters by selecting the sampled instance with the highest posterior probability (i.e., MAP estimate). Since we will only return the MAP estimate, it is possible to speed up the search by modifying Gibbs sampling. In particular, we obtain a stochastic hill climbing algorithm by accepting a new sample only if its posterior probability improves upon that of the previous sample 5 Posterior Probability 0.1 0.08 Difference in Gini Impurity 0.1 0.05 0 −0.05 0.06 0.04 0.02 −0.1 1 0 1 1 0.8 0.5 0.6 0.8 0.4 Link Reliability of Test 2 and Cause 1 0 0.6 0.2 0 0.4 Link Reliability of Test 2 and Cause 2 Figure 4: Difference in Gini impurity for the network in 1 2 Fig. 1 when θ2 and θ2 are the only parameters allowed to vary. 0.2 Link Reliability of Test 2 and Cause 1 0 0 0.2 0.4 0.6 0.8 1 Link Reliability of Test 2 and Cause 1 Figure 5: Posterior over parameters computed through calculation after discretization. Figure 6: Posterior over parameters calculated through Sampling. (line 15). Thus, each iteration of the stochastic hill climber requires more time, but always improves the solution. As the number of constraints grows and the feasibility region shrinks, the Gibbs sampler and stochastic hill climber will reject most samples. We can mitigate this by using a Greedy sampler that caps the number of rejected samples, after which it abandons the sampling for the current variable to move on to the next variable (line 19). Even though the feasibility region is small overall, it may still be large in some dimensions, so it makes sense to try sampling another variable (that may have a larger range of feasible values) when it is taking too long to ﬁnd a new feasible value for the current variable. 4.4 MODEL REFINEMENT WITH INCONSISTENT CONSTRAINTS So far, we have assumed that the expert’s actions generate a feasible region as a consequence of consistent constraints. We handle inconsistencies by further extending our augmented diagnostic Bayes network. We treat the observed constraint variable, V , as a probabilistic indicator of the true constraint V ∗ as shown in Figure 3. We can easily extend our techniques for computing the MAP to cater for this new constraint node by sampling an extra variable. 5 EVALUATION AND EXPERIMENTS 5.1 EVALUATION CRITERIA Formally, for M ∗ , the true model that we aim to learn, the diagnostic process determines the choice of best next test as the one with the smallest Gini impurity. If the correct choice for the next test is known (such as demonstrated by an expert), we can use this information to include a constraint on the model. We denote by V+ the set of observed constraints and by V∗ the set of all possible constraints that hold for M ∗ . Having only observed V+ , our technique will consider any M + ∈ M+ as a possible true model, where M+ is the set of all models that obey V + . We denote by M∗ the set of all models that are diagnostically equivalent to M ∗ (i.e., obey V ∗ and would recommend the MAP same steps as M ∗ ) and by MV+ the particular model obtained by MAP estimation based on the MAP constraints V+ . Similarly, when a dataset D is available, we denote by MD the model obtained MAP by MAP estimation based on D and by MDV+ , the model based on D and V+ . Ideally we would like to ﬁnd the true underlying model M ∗ , hence we will report the KL divergence between the models found and M ∗ . However, other diagnostically equivalent M ∗ may recommend the same tests as M ∗ and thus have similar constraints, so we also report test consistency with M ∗ (i.e., # of recommended tests that are the same). 5.2 CORRECTNESS OF MODEL REFINEMENT Given V∗ , our technique for model adjustment is guaranteed to choose a model M MAP ∈ M∗ by construction. If any constraint V ∗ ∈ V∗ is violated, the rejection sampling step of our technique 6 100 Comparing convergence of Different Techniques 80 70 60 50 40 30 Data Only Constraints Only Data+Constraints 20 10 0 1 2 3 4 5 Number of constraints used 6 −10 −12 −14 −16 −18 7 −20 Figure 7: Mean KLdivergence and one standard deviation for a 3 cause 3 test network on learning with data, constraints and data+constraints. Gibbs Sampling Stochastic Hill Climbing Greedy Sampling −8 Negative Log Likelihood of MAP Estimate Percentage of tests correctly predicted 90 0 1 2 3 10 10 10 10 Elapsed Time (plotted on log scale from 0 to 1500 seconds) Figure 8: Test Consistency for a 3 cause 3 test network on learning with data, constraints and data+constraints. Figure 9: Convergence rate comparison. would reject that set of parameters. To illustrate this, consider the network in Fig. 2. There are six parameters (four link reliabilities and two leak parameters). Let us ﬁx the leak parameters and the link reliability from the ﬁrst cause to each test. Now we can compute the posterior surface over the two variable parameters after discretizing each parameter in small steps and then calculating the posterior probability at each step as shown in Fig. 5. We can compare this surface with that obtained after Gibbs sampling using our technique as shown in Fig. 6. We can see that our technique recovers the posterior surface from which we can compute the MAP. We obtain the same MAP estimate with the stochastic hill climbing and greedy search algorithms. 5.3 EXPERIMENTAL RESULTS ON SYNTHETIC PROBLEMS We start by presenting our results on a 3-cause by 3-test fully-connected bipartite Bayes network. We assume that there exists some M ∗ ∈ M∗ that we want to learn given V+ . We use our technique to ﬁnd M MAP . To evaluate M MAP , we ﬁrst compute the constraints, V∗ for M ∗ to get the feasible region associated with the true model. Next, we sample 100 other models from this feasible region that are diagnostically equivalent. We compare these models with M MAP (after collecting 200 samples with non-informative priors for the parameters). We compute the KL-divergence of M MAP with respect to each sampled model. We expect KLdivergence to decrease as the number of constraints in V+ increases since the feasible region beMAP comes smaller. Figure 7 conﬁrms this trend and shows that MDV+ has lower mean KL-divergence MAP MAP than MV+ , which has lower mean KL-divergence than MD . The data points in D are limited to the results of the diagnostic sessions needed to obtain V+ . As constraints increase, more data is available and so the results for the data-only approach also improve with increasing constraints. We also compare the test consistency when learning from data only, constraints only or both. Given a ﬁxed number of constraints, we enumerate the unobserved trajectories, and then compute the highest ranked test using the learnt model and the sampled true models, for each trajectory. The test consistency is reported as a percentage, with 100% consistency indicating that the learned and true models had the same highest ranked tests on every trajectory. Figure 8 presents these percentatges for the greedy sampling technique (the results are similar for the other techniques). It again appears that learning parameters with both constraints and data is better than learning with only constraints, which is most of the times better than learning with only data. Figure 9 compares the convergence rate of each technique to ﬁnd the MAP estimate. As expected, Stochastic Hill Climbing and Greedy Sampling take less time than Gibbs sampling to ﬁnd parameter settings with high posterior probability. 5.4 EXPERIMENTAL RESULTS ON REAL-WORLD PROBLEMS We evaluate our technique on a real-world diagnostic network collected and reported by Agosta et al. [1], where the authors collected detailed session logs over a period of seven weeks in which the 7 KL−divergence of when computing joint over all tests 8 Figure 10: Diagnostic Bayesian network collected from user trials and pruned to retain sub-networks with at least one constraint Data Only Constraints Only Data+Constraints 7 6 5 4 3 2 1 6 8 10 12 14 16 Number of constraints used 18 20 22 Figure 11: KL divergence comparison as the number of constraints increases for the real world problem. entire diagnostic sequence was recorded. The sequences intermingle model building and querying phases. The model network structure was inferred from an expert’s sequence of positing causes and tests. Test-ranking constraints were deduced from the expert’s test query sequences once the network structure is established. The 157 sessions captured over the seven weeks resulted in a Bayes network with 115 tests, 82 root causes and 188 arcs. The network consists of several disconnected sub-networks, each identiﬁed with a symptom represented by the ﬁrst test in the sequence, and all subsequent tests applied within the same subnet. There were 20 sessions from which we were able to observe trajectories with at least two tests, resulting in a total of 32 test constraints. We pruned our diagnostic network to remove the sub-networks with no constraints to get a Bayes network with 54 tests, 30 root causes, and 67 parameters divided in 7 sub-networks, as shown in Figure 10, on which we apply our model reﬁnement technique to learn the parameters for each sub-network separately. Since we don’t have the true underlying network and the full set of constraints (more constraints could be observed in future diagnostic sessions), we treated the 32 constraints as if they were V∗ and the corresponding feasible region M∗ as if it contained models diagnostically equivalent to the unknown true model. Figure 11 reports the KL divergence between the models found by our algorithms and sampled models from M∗ as we increase the number of constraints. With such limited constraints and consequently large feasible regions, it is not surprising that the variation in KL divergence is large. Again, the MAP estimate based on both the constraints and the data has lower KL divergence than constraints only and data only. 6 CONCLUSION AND FUTURE WORK In summary, we presented an approach that can learn the parameters of a Bayes network based on constraints implied by test consistency and any data available. While several approaches exist to incorporate qualitative constraints in learning procedures, our work makes two important contributions: First, this is the ﬁrst approach that exploits implicit constraints based on value of information assessments. Secondly it is the ﬁrst approach that can handle non-convex constraints. We demonstrated the approach on synthetic data and on a real-world manufacturing diagnostic problem. Since data is generally sparse in diagnostics, this work makes an important advance to mitigate the model acquisition bottleneck, which has prevented the widespread application of diagnostic networks so far. In the future, it would be interesting to generalize this work to reinforcement learning in applications where data is sparse, but constraints may be inferred from expert interactions. Acknowledgments This work was supported by a grant from Intel Corporation. 8 References [1] John Mark Agosta, Omar Zia Khan, and Pascal Poupart. Evaluation results for a query-based diagnostics application. In The Fifth European Workshop on Probabilistic Graphical Models (PGM 10), Helsinki, Finland, September 13–15 2010. [2] Eric E. Altendorf, Angelo C. Restiﬁcar, and Thomas G. Dietterich. Learning from sparse data by exploiting monotonicity constraints. In Proceedings of Twenty First Conference on Uncertainty in Artiﬁcial Intelligence (UAI), Edinburgh, Scotland, July 2005. [3] Brigham S. Anderson and Andrew W. Moore. Fast information value for graphical models. In Proceedings of Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 51–58, Vancouver, BC, Canada, December 2005. [4] Cassio P. de Campos and Qiang Ji. Improving Bayesian network parameter learning using constraints. In International Conference in Pattern Recognition (ICPR), Tampa, FL, USA, 2008. [5] Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 141–148, Montreal, QC, Canada, 1995. [6] Ad J. Feelders. A new parameter learning method for Bayesian networks with qualitative inﬂuences. In Proceedings of Twenty Third International Conference on Uncertainty in Artiﬁcial Intelligence (UAI), Vancouver, BC, July 2007. [7] Mara Angeles Gil and Pedro Gil. A procedure to test the suitability of a factor for stratiﬁcation in estimating diversity. Applied Mathematics and Computation, 43(3):221 – 229, 1991. [8] David Heckerman and John S. Breese. Causal independence for probability assessment and inference using bayesian networks. IEEE Systems, Man, and Cybernetics, 26(6):826–831, November 1996. [9] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49–56, 1995. [10] Ronald A. Howard. Information value theory. IEEE Transactions on Systems Science and Cybernetics, 2(1):22–26, August 1966. [11] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponential families. In Proceedings of Twenty Sixth Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, June 2009. [12] Wenhui Liao and Qiang Ji. Learning Bayesian network parameters under incomplete data with domain knowledge. Pattern Recognition, 42:3046–3056, 2009. [13] Yi Mao and Guy Lebanon. Domain knowledge uncertainty and probabilistic parameter constraints. In Proceedings of Twenty Fifth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), Montreal, QC, Canada, 2009. [14] Ryszard S. Michalski. A theory and methodology of inductive learning. Artiﬁcial Intelligence, 20:111–116, 1984. [15] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with parameter constraints. Journal of Machine Learning Research, 7:1357–1383, 2006. [16] Mark A. Peot and Ross D. Shachter. Learning from what you dont observe. In Proceedings of the Fourteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 439–446, Madison, WI, July 1998. [17] Michael P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artiﬁcial Intelligence, 44(3):257–303, August 1990. [18] Frank Wittig and Anthony Jameson. Exploiting qualitative knowledge in the learning of conditional probabilities of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), San Francisco, CA, July 2000. 9

5 0.81705159 302 nips-2011-Variational Learning for Recurrent Spiking Networks

Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner

6 0.81167823 86 nips-2011-Empirical models of spiking in neural populations

7 0.80638611 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

8 0.80570132 145 nips-2011-Learning Eigenvectors for Free

9 0.80316466 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

10 0.80211306 24 nips-2011-Active learning of neural response functions with Gaussian processes

11 0.79179657 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

12 0.7912029 219 nips-2011-Predicting response time and error rates in visual search

13 0.78987801 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

14 0.78955591 75 nips-2011-Dynamical segmentation of single trials from population neural data

15 0.78746343 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

16 0.78574145 249 nips-2011-Sequence learning with hidden units in spiking neural networks

17 0.78290659 102 nips-2011-Generalised Coupled Tensor Factorisation

18 0.7774877 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise

19 0.76440072 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

20 0.7620427 301 nips-2011-Variational Gaussian Process Dynamical Systems