nips nips2013 nips2013-53 knowledge-graph by maker-knowledge-mining

53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields

Source: pdf

Author: Mijung Park, Jonathan W. Pillow

Abstract: The receptive ﬁeld (RF) of a sensory neuron describes how the neuron integrates sensory stimuli over time and space. In typical experiments with naturalistic or ﬂickering spatiotemporal stimuli, RFs are very high-dimensional, due to the large number of coefﬁcients needed to specify an integration proﬁle across time and space. Estimating these coefﬁcients from small amounts of data poses a variety of challenging statistical and computational problems. Here we address these challenges by developing Bayesian reduced rank regression methods for RF estimation. This corresponds to modeling the RF as a sum of space-time separable (i.e., rank-1) ﬁlters. This approach substantially reduces the number of parameters needed to specify the RF, from 1K-10K down to mere 100s in the examples we consider, and confers substantial beneﬁts in statistical power and computational efﬁciency. We introduce a novel prior over low-rank RFs using the restriction of a matrix normal prior to the manifold of low-rank matrices, and use “localized” row and column covariances to obtain sparse, smooth, localized estimates of the spatial and temporal RF components. We develop two methods for inference in the resulting hierarchical model: (1) a fully Bayesian method using blocked-Gibbs sampling; and (2) a fast, approximate method that employs alternating ascent of conditional marginal likelihoods. We develop these methods for Gaussian and Poisson noise models, and show that low-rank estimates substantially outperform full rank estimates using neural data from retina and V1. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Bayesian inference for low rank spatiotemporal neural receptive ﬁelds Jonathan W. [sent-1, score-0.267]

2 edu Abstract The receptive ﬁeld (RF) of a sensory neuron describes how the neuron integrates sensory stimuli over time and space. [sent-6, score-0.302]

3 In typical experiments with naturalistic or ﬂickering spatiotemporal stimuli, RFs are very high-dimensional, due to the large number of coefﬁcients needed to specify an integration proﬁle across time and space. [sent-7, score-0.098]

4 Here we address these challenges by developing Bayesian reduced rank regression methods for RF estimation. [sent-9, score-0.096]

5 We introduce a novel prior over low-rank RFs using the restriction of a matrix normal prior to the manifold of low-rank matrices, and use “localized” row and column covariances to obtain sparse, smooth, localized estimates of the spatial and temporal RF components. [sent-14, score-0.464]

6 We develop two methods for inference in the resulting hierarchical model: (1) a fully Bayesian method using blocked-Gibbs sampling; and (2) a fast, approximate method that employs alternating ascent of conditional marginal likelihoods. [sent-15, score-0.192]

7 We develop these methods for Gaussian and Poisson noise models, and show that low-rank estimates substantially outperform full rank estimates using neural data from retina and V1. [sent-16, score-0.248]

8 1 Introduction A neuron’s linear receptive ﬁeld (RF) is a ﬁlter that maps high-dimensional sensory stimuli to a one-dimensional variable underlying the neuron’s spike rate. [sent-17, score-0.16]

9 In white noise or reverse-correlation experiments, the dimensionality of the RF is determined by the number of stimulus elements in the spatiotemporal window inﬂuencing a neuron’s probability of spiking. [sent-18, score-0.151]

10 For a stimulus movie with nx ×ny pixels per frame, the RF has nx ny nt coefﬁcients, where nt is the (experimenter-determined) number of movie frames in the neuron’s temporal integration window. [sent-19, score-0.448]

11 A substantial literature has therefore focused on methods for regularizing RF estimates to improve accuracy in the face of limited experimental data. [sent-22, score-0.065]

12 Popular methods have involved priors to impose smallness, sparsity, smoothness, and localized structure in RF coefﬁcients[1, 2, 3, 4, 5]. [sent-24, score-0.119]

13 Moreover, it can substantially reduce the number of RF parameters: a rank p receptive ﬁeld in nx ny nt dimensions requires only p(nx ny + nt − 1) parameters, since a single space-time separable ﬁlter has nx ny spatial coefﬁcients and nt − 1 temporal coefﬁcients (i. [sent-27, score-0.688]

14 When p min(nx ny , nt ), as commonly occurs in experimental settings, this parametrization yields considerable savings. [sent-30, score-0.069]

15 In the statistics literature, the problem of estimating a low-rank matrix of regression coefﬁcients is known as reduced rank regression [10, 11]. [sent-31, score-0.138]

16 Here we formulate a novel prior for reduced rank regression using a restriction of the matrix normal distribution [13] to the manifold of low-rank matrices. [sent-33, score-0.171]

17 Moreover, under a linear-Gaussian response model, the posterior over RF rows and columns are conditionally Gaussian, leading to fast and efﬁcient sampling-based inference methods. [sent-35, score-0.175]

18 We use a “localized” form for the row and and column covariances in the matrix normal prior, which have hyperparameters governing smoothness and locality of RF components in space and time [5]. [sent-36, score-0.118]

19 In addition to fully Bayesian sampling-based inference, we develop a fast approximate inference method using coordinate ascent of the conditional marginal likelihoods for temporal (column) and spatial (row) hyperparameters. [sent-37, score-0.39]

20 2, we describe the low-rank RF model with localized priors. [sent-41, score-0.096]

21 3, we describe a fully Bayesian inference method using the blocked-Gibbs sampling with interleaved Metroplis Hastings steps. [sent-43, score-0.066]

22 4, we introduce a fast method for approximate inference using conditional empirical Bayesian hyperparameter estimates. [sent-45, score-0.209]

23 1 Hierarchical low-rank receptive ﬁeld model Response model (likelihood) We begin by deﬁning two probabilistic encoding models that will provide likelihood functions for RF inference. [sent-51, score-0.197]

24 Let yi denote the number of spikes that occur in response to a (dt × dx ) matrix stimulus Xi , where dt and dx denote the number of temporal and spatial elements in the RF, respectively. [sent-52, score-0.41]

25 Let K denote the neuron’s (dt × dx ) matrix receptive ﬁeld. [sent-53, score-0.154]

26 We will consider, ﬁrst, a linear Gaussian encoding model: yi |Xi ∼ N (xi k + b, γ), (1) where xi = vec(Xi ) and k = vec(K) denote the vectorized stimulus and vectorized RF, respectively, γ is the variance of the response noise, and b is a bias term. [sent-54, score-0.176]

27 2 Prior for low rank receptive ﬁeld We can represent an RF of rank p using the factorization K where the columns of the matrix Kt ∈ R Kx ∈ Rdx ×p contain spatial ﬁlters. [sent-59, score-0.353]

28 = Kt Kx , dt ×p contain temporal ﬁlters and the columns of the matrix 2 (3) We deﬁne a prior over rank-p matrices using a restriction of the matrix normal distribution MN (0, Cx , Ct ). [sent-60, score-0.22]

29 The prior is controlled by a “column” covariance matrix Ct ∈ Rdt ×dt and “row” covariance matrix Cx ∈ Rdx ×dx , which govern the temporal and spatial RF components, respectively. [sent-62, score-0.318]

30 In the ALD prior, the covariance matrix encodes the tendency for RFs to be localized in both space-time and spatiotemporal frequency. [sent-68, score-0.212]

31 The positive deﬁnite matrices Φs and Φf are D × D determine the size of the local region of RF support in space and spatial frequency, respectively [15]. [sent-72, score-0.077]

32 In the temporal covariance matrix Ct , the hyperparameters θt , which are directly are analogous to θx , determine the localized RF structure in time and temporal frequency. [sent-73, score-0.37]

33 3 Posterior inference using Markov Chain Monte Carlo For a complete dataset D = {X, y}, where X ∈ Rn×(dt dx ) is a design matrix, and y is a vector of responses, our goal is to infer the joint posterior over K and b, p(K, b|D) ∝ 2 2 2 p(D|K, b)p(K|θt , θx )p(b|σb )p(θt , θx , σb )dσb dθt dθx . [sent-75, score-0.145]

34 Blocked-Gibbs sampling is possible since the closed-form conditional priors in eq. [sent-77, score-0.154]

35 6 and the Gaussian likelihood yields closed-form “conditional marginal likelihood” for θt |(kx , θx , D) 2 and θx |(kt , θt , D), respectively1 . [sent-78, score-0.067]

36 The blocked-Gibbs ﬁrst samples (σb , θt , γ) from the conditional evidence and simultaneously sample kt from the conditional posterior. [sent-79, score-0.643]

37 Given the samples 2 of (σb , θt , γ, b, kt ), we then sample θx and kx similarly. [sent-80, score-0.828]

38 For sampling from the conditional evidence, we use the Metropolis Hastings (MH) algorithm to sample the low dimensional space of hyperparameters. [sent-81, score-0.131]

39 For sampling (b, kt ) and kx , we use the closed-form formula (will be introduced shortly) for the mean of the conditional posterior. [sent-82, score-0.937]

40 (10) We use the MH algorithm to search over the low dimensional hyperparameter space, with the conditional evidence (eq. [sent-92, score-0.182]

41 • We sample (b, kt ) from the conditional posterior given in eq. [sent-94, score-0.515]

42 and (12) As in Step 1, with a uniform hyperprior on θx , the conditional evidence is the target distribution in the MH algorithm. [sent-98, score-0.213]

43 • We sample kx from the conditional posterior given in eq. [sent-99, score-0.647]

44 4 Algorithm 1 fully Bayesian low-rank RF inference using blocked-Gibbs sampling Given data D, conditioned on samples for other variables, iterate the following: 2 2 1. [sent-103, score-0.088]

45 Sample for (b, kt , σb , θt , γ) from the conditional evidence for (θt , σb , γ) (in eq. [sent-104, score-0.519]

46 8) and the conditional posterior over (b, kt ) (in eq. [sent-105, score-0.515]

47 Sample for (kx , θx ) from the conditional evidence for θx (in eq. [sent-108, score-0.182]

48 11) and the conditional posterior over kx (in eq. [sent-109, score-0.647]

49 4 Approximate algorithm for fast posterior inference Here we develop an alternative, approximate algorithm for fast posterior inference. [sent-112, score-0.312]

50 Instead of integrating over hyperparameters, we attempt to ﬁnd point estimates that maximize the conditional marginal likelihood. [sent-113, score-0.167]

51 In our model, the evidence has no closed form; how2 ever, the conditional evidence for (θt , σb , γ) given (kx , θx ) and the conditional evidence for θx given 2 (b, kt , θt , σb , γ) are given in closed form (in eq. [sent-115, score-0.827]

52 (16) b,kt ˆ θx = θx ˆ kx = kx The approximate algorithm works well if the conditional evidence is tightly concentrated around its maximum. [sent-119, score-1.155]

53 Note that if the hyperparameters are ﬁxed, the iterative updates of (b, kt ) and kx given above amount to alternating coordinate ascent of the posterior over (b, K). [sent-120, score-0.922]

54 5 Extension to Poisson likelihood When the likelihood is non-Gaussian, blocked-Gibbs sampling is not tractable, because we do not have a closed form expression for conditional evidence. [sent-121, score-0.288]

55 Here, we introduce a fast, approximate inference algorithm for the low-rank RF model under the LNP likelihood. [sent-122, score-0.072]

56 However, we make a Gaussian approximation to the conditional posterior over (b, kt ) given kx via the Laplace approximation. [sent-125, score-0.984]

57 We then approximate 2 the conditional evidence for (θt , σb ) given kx at the posterior mode of (b, kt ) given kx . [sent-126, score-1.568]

58 t 5 A 1 ML true k B low-rank Gibbs low-rank fast 2 1 1 space MSE 250 samples time 16 full-rank 64 2000 samples ML full-rank low-rank (fast) low-rank (Gibbs) 0. [sent-132, score-0.079]

59 Estimates obtained by ML, full-rank ALD, low-rank approximate method, and blocked-Gibbs sampling, using 250 samples (top), and using 2000 samples (bottom), respectively. [sent-139, score-0.079]

60 17) at the posterior ˆ mode wt = wt is simply 2 log p(D|θt , σb , kx ) ≈ −1 ˆ ˆ ˆ log p(D|wt , Mx ) − 1 wt Cwt wt − 2 1 2 log |Cwt Σ−1 |, t 2 which we maximize to set θt and σb . [sent-144, score-1.045]

61 Due to space limit, we omit the derivations for the conditional posterior for kx and the conditional evidence for θx given (b, kt ). [sent-145, score-1.166]

62 1 Simulations We ﬁrst tested the performance of the blocked-Gibbs sampling and the fast approximate algorithm on a simulated Gaussian neuron with a rank-2 RF of 16 temporal bins and 64 spatial pixels shown in Fig. [sent-148, score-0.366]

63 We compared these methods with the maximum likelihood estimate and the full-rank ALD estimate. [sent-150, score-0.067]

64 1 shows that the low-rank RF estimates obtained by the blocked-Gibbs sampling and the approximate algorithm perform similarly, and achieve lower mean squared error than the full-rank RF estimates. [sent-152, score-0.129]

65 The low-rank RF estimates under the LNP model perform better than those under the linear Gaussian model. [sent-162, score-0.065]

66 We then tested the performance of the above methods on a simulated linear-nonlinear Poisson (LNP) neuron with the same RF and the softrect nonlinearity. [sent-163, score-0.113]

67 2 shows that the low-rank RF 6 rank-1 low-rank (Gibbs) relative likelihood per stimulus rank-4 B 24 1 space 16 low-rank STA 0. [sent-166, score-0.192]

68 25 rank-2 1 low-rank (fast) time V1 simple cell #1 relative likelihood per stimulus A 1 2 3 rank 4 Figure 3: Comparison of low-rank RF estimates for V1 simple cells (using white noise ﬂickering bars stimuli [16]). [sent-170, score-0.397]

69 A: Relative likelihood per test stimulus (left) and low-rank RF estimates for three different ranks (right). [sent-171, score-0.229]

70 Relative likelihood is the ratio of the test likelihood of rank-1 STA to that of other estimates. [sent-172, score-0.134]

71 The rank-4 estimates obtained by the blocked-Gibbs sampling and the approximate method achieve the highest test likelihood for this cell. [sent-177, score-0.196]

72 estimates perform better than full-rank estimates regardless of the model, and that the low-rank RF estimates under the LNP model achieved the lowest MSE. [sent-180, score-0.195]

73 2 Application to neural data We applied our methods to estimate the RFs of V1 simple cells and retinal ganglion cells (RGCs). [sent-182, score-0.079]

74 4, we show the average test likelihood as a function of RF rank under the linear Gaussian model. [sent-187, score-0.144]

75 We also show the low-rank RF estimates obtained by our methods as well as the low-rank STA. [sent-188, score-0.065]

76 If the stimulus distribution is non-Gaussian, the low-rank STA will have larger bias than the low-rank ALD estimate. [sent-190, score-0.097]

77 A RGC off-cell spatial extent temporal extent relative likelihood per stimulus 1st low-rank (Gibbs) 0. [sent-191, score-0.401]

78 9 2nd low-rank STA 1 B 3rd 2nd low-rank (fast) 1 2 rank 3 1 10 3rd 0 25 spatial extent temporal extent 1 3rd low-rank (Gibbs) relative likelihood per stimulus 10 1st 4 RGC on-cell 1 1 1st 2nd low-rank (fast) 10 1 10 0. [sent-192, score-0.478]

79 9 low-rank STA 1 2 3 rank 4 0 25 7 Figure 4: Comparison of low-rank RF estimates for retinal data (using binary white noise stimuli [9]). [sent-193, score-0.212]

80 The RF consists of 10 by 10 spatial pixels and 25 temporal bins (2500 RF coefﬁcients). [sent-194, score-0.185]

81 A: Relative likelihood per test stimulus (left), top three left singular vectors (middle) and right singular vectors (right) of estimated RF for an off-RGC cell. [sent-195, score-0.204]

82 The samplingbased RF estimate beneﬁts from a rank-3 representation, making use of three distinct spatial and temporal components, whereas the performance of the low-rank STA degrades above rank 1. [sent-196, score-0.24]

83 1 ML rank -2 (Gaussian) (Gaussian) prediction error A 103 10 2 1 10 0 10 0. [sent-211, score-0.077]

84 5 1 2 # minutes of training data Figure 5: RF estimates for a V1 simple cell. [sent-213, score-0.097]

85 A: RF estimates obtained by ML (left) and low-rank blocked-Gibbs sampling under the linear Gaussian model (middle), and low-rank approximate algorithm under the LNP model (right), for two different amounts of training data (30 sec. [sent-215, score-0.129]

86 The RF consists of 16 temporal and 16 spatial dimensions (256 RF coefﬁcients). [sent-218, score-0.163]

87 The low-rank RF estimates under the LNP model achieved the lowest prediction error among all other methods. [sent-220, score-0.065]

88 We computed the test likelihood of each estimate to set the RF rank and found that the rank-2 RF estimates achieved the highest test likelihood. [sent-231, score-0.209]

89 In terms of the average prediction error, the low-rank RF estimates obtained by our fast approximate algorithm achieved the lowest error, while the runtime of the algorithm was signiﬁcantly lower than full-rank inference methods. [sent-232, score-0.191]

90 We introduced a novel prior for low-rank matrices based on a restricted matrix normal distribution, which has the feature of preserving a marginally Gaussian prior over the regression coefﬁcients. [sent-234, score-0.144]

91 We used a “localized” form to deﬁne row and column covariance matrices in the matrix normal prior, which allows the model to ﬂexibly learn smooth and sparse structure in RF spatial and temporal components. [sent-235, score-0.28]

92 We developed two inference methods: an exact one based on MCMC with blocked-Gibbs sampling and an approximate one based on alternating evidence optimization. [sent-236, score-0.181]

93 Overall, we found that low-rank estimates achieved higher prediction accuracy with signiﬁcantly lower computation time compared to full-rank estimates. [sent-238, score-0.065]

94 We believe our localized, low-rank RF model will be especially useful in high-dimensional settings, particularly in cases where the stimulus covariance matrix does not ﬁt in memory. [sent-239, score-0.159]

95 In future work, we will develop fully Bayesian inference methods for low-rank RFs under the LNP noise model, which will allow us to quantify the accuracy of our approximate method. [sent-240, score-0.09]

96 Estimating spatio-temporal receptive ﬁelds of auditory and visual neurons from their responses to natural stimuli. [sent-262, score-0.148]

97 Spectrotemporal structure of receptive ﬁelds in areas ai and aaf of mouse auditory cortex. [sent-295, score-0.129]

98 Gabor analysis of auditory midbrain receptive ﬁelds: Spectro-temporal and binaural composition. [sent-300, score-0.129]

99 Maximum likelihood estimation of cascade point-process neural encoding models. [sent-341, score-0.098]

100 Bayesian active learning with localized priors for fast receptive ﬁeld characterization. [sent-347, score-0.253]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rf', 0.555), ('kx', 0.469), ('kt', 0.337), ('lnp', 0.19), ('cx', 0.156), ('ald', 0.155), ('cwt', 0.14), ('sta', 0.139), ('wt', 0.125), ('rfs', 0.114), ('conditional', 0.102), ('mx', 0.1), ('receptive', 0.099), ('stimulus', 0.097), ('localized', 0.096), ('temporal', 0.086), ('ct', 0.082), ('ml', 0.081), ('evidence', 0.08), ('spatial', 0.077), ('rank', 0.077), ('posterior', 0.076), ('likelihood', 0.067), ('estimates', 0.065), ('neuron', 0.06), ('nx', 0.058), ('poisson', 0.054), ('spatiotemporal', 0.054), ('vec', 0.051), ('cients', 0.044), ('gaussian', 0.042), ('hyperparameters', 0.04), ('coef', 0.04), ('stimuli', 0.039), ('covariance', 0.039), ('nt', 0.038), ('mt', 0.038), ('inference', 0.037), ('pillow', 0.036), ('dt', 0.036), ('fast', 0.035), ('approximate', 0.035), ('row', 0.034), ('bayesian', 0.034), ('gibbs', 0.032), ('dx', 0.032), ('minutes', 0.032), ('encoding', 0.031), ('mh', 0.031), ('retinal', 0.031), ('dkx', 0.031), ('hyperprior', 0.031), ('mxt', 0.031), ('schreiner', 0.031), ('softrect', 0.031), ('prior', 0.031), ('ny', 0.031), ('auditory', 0.03), ('sampling', 0.029), ('relative', 0.028), ('rdx', 0.027), ('dwt', 0.027), ('ickering', 0.027), ('poiss', 0.027), ('response', 0.027), ('eld', 0.027), ('separable', 0.026), ('rgc', 0.025), ('neurophysiology', 0.025), ('cells', 0.024), ('naturalistic', 0.024), ('matrix', 0.023), ('extent', 0.023), ('park', 0.023), ('closed', 0.023), ('retina', 0.023), ('movshon', 0.023), ('rust', 0.023), ('priors', 0.023), ('sensory', 0.022), ('simulated', 0.022), ('pixels', 0.022), ('samples', 0.022), ('litke', 0.022), ('hastings', 0.022), ('sahani', 0.022), ('normal', 0.021), ('xi', 0.021), ('chichilnisky', 0.02), ('christoph', 0.02), ('singular', 0.02), ('integration', 0.02), ('marginally', 0.019), ('regression', 0.019), ('responses', 0.019), ('econometrics', 0.019), ('stacking', 0.019), ('runtime', 0.019), ('sher', 0.018), ('develop', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields

Author: Mijung Park, Jonathan W. Pillow

2 0.1830605 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin

Abstract: This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a lowvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets. 1

3 0.13622871 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends

Author: Matus Telgarsky, Sanjoy Dasgupta

Abstract: Suppose k centers are ﬁt to m points by heuristically minimizing the k-means cost; what is the corresponding ﬁt over the source distribution? This question is resolved here for distributions with p 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as mmin{ 1/4, 1/2+2/p} . The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of k-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with reﬁned constants is provided for k-means instances possessing some cluster structure. 1

4 0.11277734 305 nips-2013-Spectral methods for neural characterization using generalized quadratic models

Author: Il M. Park, Evan W. Archer, Nicholas Priebe, Jonathan W. Pillow

Abstract: We describe a set of fast, tractable methods for characterizing neural responses to high-dimensional sensory stimuli using a model we refer to as the generalized quadratic model (GQM). The GQM consists of a low-rank quadratic function followed by a point nonlinearity and exponential-family noise. The quadratic function characterizes the neuron’s stimulus selectivity in terms of a set linear receptive ﬁelds followed by a quadratic combination rule, and the invertible nonlinearity maps this output to the desired response range. Special cases of the GQM include the 2nd-order Volterra model [1, 2] and the elliptical Linear-Nonlinear-Poisson model [3]. Here we show that for “canonical form” GQMs, spectral decomposition of the ﬁrst two response-weighted moments yields approximate maximumlikelihood estimators via a quantity called the expected log-likelihood. The resulting theory generalizes moment-based estimators such as the spike-triggered covariance, and, in the Gaussian noise case, provides closed-form estimators under a large class of non-Gaussian stimulus distributions. We show that these estimators are fast and provide highly accurate estimates with far lower computational cost than full maximum likelihood. Moreover, the GQM provides a natural framework for combining multi-dimensional stimulus sensitivity and spike-history dependencies within a single model. We show applications to both analog and spiking data using intracellular recordings of V1 membrane potential and extracellular recordings of retinal spike trains. 1

5 0.09851186 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

Author: Yuhong Guo

Abstract: Principal component analysis (PCA), a well-established technique for data analysis and processing, provides a convenient form of dimensionality reduction that is effective for cleaning small Gaussian noises presented in the data. However, the applicability of standard principal component analysis in real scenarios is limited by its sensitivity to large errors. In this paper, we tackle the challenge problem of recovering data corrupted with errors of high magnitude by developing a novel robust transfer principal component analysis method. Our method is based on the assumption that useful information for the recovery of a corrupted data matrix can be gained from an uncorrupted related data matrix. Speciﬁcally, we formulate the data recovery problem as a joint robust principal component analysis problem on the two data matrices, with common principal components shared across matrices and individual principal components speciﬁc to each data matrix. The formulated optimization problem is a minimization problem over a convex objective function but with non-convex rank constraints. We develop an efﬁcient proximal projected gradient descent algorithm to solve the proposed optimization problem with convergence guarantees. Our empirical results over image denoising tasks show the proposed method can effectively recover images with random large errors, and signiﬁcantly outperform both standard PCA and robust PCA with rank constraints. 1

6 0.096153095 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

7 0.088615797 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

8 0.087664537 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

9 0.086237177 173 nips-2013-Least Informative Dimensions

10 0.082051933 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

11 0.081891686 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

12 0.081288703 237 nips-2013-Optimal integration of visual speed across different spatiotemporal frequency channels

13 0.07617943 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

14 0.071588442 201 nips-2013-Multi-Task Bayesian Optimization

15 0.06521178 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

16 0.064344883 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data

17 0.061324932 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

18 0.060228698 205 nips-2013-Multisensory Encoding, Decoding, and Identification

19 0.058591589 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

20 0.056969509 137 nips-2013-High-Dimensional Gaussian Process Bandits

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.154), (1, 0.055), (2, -0.004), (3, -0.025), (4, -0.101), (5, 0.019), (6, 0.018), (7, 0.072), (8, 0.078), (9, 0.023), (10, -0.077), (11, 0.059), (12, -0.032), (13, -0.025), (14, -0.009), (15, 0.081), (16, -0.087), (17, -0.009), (18, -0.068), (19, -0.008), (20, -0.009), (21, 0.008), (22, -0.037), (23, -0.111), (24, 0.013), (25, -0.006), (26, -0.062), (27, 0.017), (28, -0.0), (29, -0.004), (30, 0.031), (31, -0.102), (32, 0.091), (33, -0.03), (34, -0.058), (35, -0.055), (36, 0.023), (37, -0.151), (38, -0.086), (39, -0.055), (40, -0.014), (41, -0.035), (42, -0.07), (43, -0.123), (44, -0.056), (45, 0.151), (46, -0.026), (47, 0.061), (48, -0.071), (49, -0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93012017 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields

Author: Mijung Park, Jonathan W. Pillow

2 0.62033474 305 nips-2013-Spectral methods for neural characterization using generalized quadratic models

Author: Il M. Park, Evan W. Archer, Nicholas Priebe, Jonathan W. Pillow

3 0.58186364 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin

4 0.5664261 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

Author: Zhuo Wang, Alan Stocker, Daniel Lee

Abstract: In many neural systems, information about stimulus variables is often represented in a distributed manner by means of a population code. It is generally assumed that the responses of the neural population are tuned to the stimulus statistics, and most prior work has investigated the optimal tuning characteristics of one or a small number of stimulus variables. In this work, we investigate the optimal tuning for diffeomorphic representations of high-dimensional stimuli. We analytically derive the solution that minimizes the L2 reconstruction loss. We compared our solution with other well-known criteria such as maximal mutual information. Our solution suggests that the optimal weights do not necessarily decorrelate the inputs, and the optimal nonlinearity differs from the conventional equalization solution. Results illustrating these optimal representations are shown for some input distributions that may be relevant for understanding the coding of perceptual pathways. 1

5 0.55690181 237 nips-2013-Optimal integration of visual speed across different spatiotemporal frequency channels

Author: Matjaz Jogan, Alan Stocker

Abstract: How do humans perceive the speed of a coherent motion stimulus that contains motion energy in multiple spatiotemporal frequency bands? Here we tested the idea that perceived speed is the result of an integration process that optimally combines speed information across independent spatiotemporal frequency channels. We formalized this hypothesis with a Bayesian observer model that combines the likelihood functions provided by the individual channel responses (cues). We experimentally validated the model with a 2AFC speed discrimination experiment that measured subjects’ perceived speed of drifting sinusoidal gratings with different contrasts and spatial frequencies, and of various combinations of these single gratings. We found that the perceived speeds of the combined stimuli are independent of the relative phase of the underlying grating components. The results also show that the discrimination thresholds are smaller for the combined stimuli than for the individual grating components, supporting the cue combination hypothesis. The proposed Bayesian model ﬁts the data well, accounting for the full psychometric functions of both simple and combined stimuli. Fits are improved if we assume that the channel responses are subject to divisive normalization. Our results provide an important step toward a more complete model of visual motion perception that can predict perceived speeds for coherent motion stimuli of arbitrary spatial structure. 1

6 0.45709673 205 nips-2013-Multisensory Encoding, Decoding, and Identification

7 0.44394737 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations

8 0.41038167 167 nips-2013-Learning the Local Statistics of Optical Flow

9 0.40865302 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

10 0.40711015 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding

11 0.3867242 173 nips-2013-Least Informative Dimensions

12 0.38482952 284 nips-2013-Robust Spatial Filtering with Beta Divergence

13 0.37598196 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

14 0.37541363 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

15 0.37476519 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

16 0.37447515 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

17 0.37265548 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference

18 0.36892834 265 nips-2013-Reconciling "priors" & "priors" without prejudice?

19 0.3682518 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

20 0.3657257 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.013), (16, 0.046), (33, 0.139), (34, 0.165), (41, 0.034), (49, 0.032), (56, 0.053), (70, 0.045), (82, 0.215), (85, 0.028), (89, 0.079), (93, 0.047), (95, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82860768 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields

Author: Mijung Park, Jonathan W. Pillow

2 0.73216575 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

Author: Ryan D. Turner, Steven Bottone, Clay J. Stanek

Abstract: The Bayesian online change point detection (BOCPD) algorithm provides an efﬁcient way to do exact inference when the parameters of an underlying model may suddenly change over time. BOCPD requires computation of the underlying model’s posterior predictives, which can only be computed online in O(1) time and memory for exponential family models. We develop variational approximations to the posterior on change point times (formulated as run lengths) for efﬁcient inference when the underlying model is not in the exponential family, and does not have tractable posterior predictive distributions. In doing so, we develop improvements to online variational inference. We apply our methodology to a tracking problem using radar data with a signal-to-noise feature that is Rice distributed. We also develop a variational method for inferring the parameters of the (non-exponential family) Rice distribution. Change point detection has been applied to many applications [5; 7]. In recent years there have been great improvements to the Bayesian approaches via the Bayesian online change point detection algorithm (BOCPD) [1; 23; 27]. Likewise, the radar tracking community has been improving in its use of feature-aided tracking [10]: methods that use auxiliary information from radar returns such as signal-to-noise ratio (SNR), which depend on radar cross sections (RCS) [21]. Older systems would often ﬁlter only noisy position (and perhaps Doppler) measurements while newer systems use more information to improve performance. We use BOCPD for modeling the RCS feature. Whereas BOCPD inference could be done exactly when ﬁnding change points in conjugate exponential family models the physics of RCS measurements often causes them to be distributed in non-exponential family ways, often following a Rice distribution. To do inference efﬁciently we call upon variational Bayes (VB) to ﬁnd approximate posterior (predictive) distributions. Furthermore, the nature of both BOCPD and tracking require the use of online updating. We improve upon the existing and limited approaches to online VB [24; 13]. This paper produces contributions to, and builds upon background from, three independent areas: change point detection, variational Bayes, and radar tracking. Although the emphasis in machine learning is on ﬁltering, a substantial part of tracking with radar data involves data association, illustrated in Figure 1. Observations of radar returns contain measurements from multiple objects (targets) in the sky. If we knew which radar return corresponded to which target we would be presented with NT ∈ N0 independent ﬁltering problems; Kalman ﬁlters [14] (or their nonlinear extensions) are applied to “average out” the kinematic errors in the measurements (typically positions) using the measurements associated with each target. The data association problem is to determine which measurement goes to which track. In the classical setup, once a particular measurement is associated with a certain target, that measurement is plugged into the ﬁlter for that target as if we knew with certainty it was the correct assignment. The association algorithms, in effect, ﬁnd the maximum a posteriori (MAP) estimate on the measurement-to-track association. However, approaches such as the joint probabilistic data association (JPDA) ﬁlter [2] and the probability hypothesis density (PHD) ﬁlter [16] have deviated from this. 1 To ﬁnd the MAP estimate a log likelihood of the data under each possible assignment vector a must be computed. These are then used to construct cost matrices that reduce the assignment problem to a particular kind of optimization problem (the details of which are beyond the scope of this paper). The motivation behind feature-aided tracking is that additional features increase the probability that the MAP measurement-to-track assignment is correct. Based on physical arguments the RCS feature (SNR) is often Rice distributed [21, Ch. 3]; although, in certain situations RCS is exponential or gamma distributed [26]. The parameters of the RCS distribution are determined by factors such as the shape of the aircraft facing the radar sensor. Given that different aircraft have different RCS characteristics, if one attempts to create a continuous track estimating the path of an aircraft, RCS features may help distinguish one aircraft from another if they cross paths or come near one another, for example. RCS also helps distinguish genuine aircraft returns from clutter: a ﬂock of birds or random electrical noise, for example. However, the parameters of the RCS distributions may also change for the same aircraft due to a change in angle or ground conditions. These must be taken into account for accurate association. Providing good predictions in light of a possible sudden change in the parameters of a time series is “right up the alley” of BOCPD and change point methods. The original BOCPD papers [1; 11] studied sudden changes in the parameters of exponential family models for time series. In this paper, we expand the set of applications of BOCPD to radar SNR data which often has the same change point structure found in other applications, and requires online predictions. The BOCPD model is highly modular in that it looks for changes in the parameters of any underlying process model (UPM). The UPM merely needs to provide posterior predictive probabilities, the UPM can otherwise be a “black box.” The BOCPD queries the UPM for a prediction of the next data point under each possible run length, the number of points since the last change point. If (and only if by Hipp [12]) the UPM is exponential family (with a conjugate prior) the posterior is computed by accumulating the sufﬁcient statistics since the last potential change point. This allows for O(1) UPM updates in both computation and memory as the run length increases. We motivate the use of VB for implementing UPMs when the data within a regime is believed to follow a distribution that is not exponential family. The methods presented in this paper can be used to ﬁnd variational run length posteriors for general non-exponential family UPMs in addition to the Rice distribution. Additionally, the methods for improving online updating in VB (Section 2.2) are applicable in areas outside of change point detection. Likelihood clutter (birds) track 1 (747) track 2 (EMB 110) 0 5 10 15 20 SNR Figure 1: Illustrative example of a tracking scenario: The black lines (−) show the true tracks while the red stars (∗) show the state estimates over time for track 2 and the blue stars for track 1. The 95% credible regions on the states are shown as blue ellipses. The current (+) and previous (×) measurements are connected to their associated tracks via red lines. The clutter measurements (birds in this case) are shown with black dots (·). The distributions on the SNR (RCS) for each track (blue and red) and the clutter (black) are shown on the right. To our knowledge this paper is the ﬁrst to demonstrate how to compute Bayesian posterior distributions on the parameters of a Rice distribution; the closest work would be Lauwers et al. [15], which computes a MAP estimate. Other novel factors of this paper include: demonstrating the usefulness (and advantages over existing techniques) of change point detection for RCS estimation and tracking; and applying variational inference for UPMs where analytic posterior predictives are not possible. This paper provides four main technical contributions: 1) VB inference for inferring the parameters of a Rice distribution. 2) General improvements to online VB (which is then applied to updating the UPM in BOCPD). 3) Derive a VB approximation to the run length posterior when the UPM posterior predictive is intractable. 4) Handle censored measurements (particularly for a Rice distribution) in VB. This is key for processing missed detections in data association. 2 1 Background In this section we brieﬂy review the three areas of background: BOCPD, VB, and tracking. 1.1 Bayesian Online Change Point Detection We brieﬂy summarize the model setup and notation for the BOCPD algorithm; see [27, Ch. 5] for a detailed description. We assume we have a time series with n observations so far y1 , . . . , yn ∈ Y. In effect, BOCPD performs message passing to do online inference on the run length rn ∈ 0:n − 1, the number of observations since the last change point. Given an underlying predictive model (UPM) and a hazard function h, we can compute an exact posterior over the run length rn . Conditional on a run length, the UPM produces a sequential prediction on the next data point using all the data since the last change point: p(yn |y(r) , Θm ) where (r) := (n − r):(n − 1). The UPM is a simpler model where the parameters θ change at every change point and are modeled as being sampled from a prior with hyper-parameters Θm . The canonical example of a UPM would be a Gaussian whose mean and variance change at every change point. The online updates are summarized as: P (rn |rn−1 ) p(yn |rn−1 , y(r) ) p(rn−1 , y1:n−1 ) . msgn := p(rn , y1:n ) = rn−1 hazard UPM (1) msgn−1 Unless rn = 0, the sum in (1) only contains one term since the only possibility is that rn−1 = rn −1. The indexing convention is such that if rn = 0 then yn+1 is the ﬁrst observation sampled from the new parameters θ. The marginal posterior predictive on the next data point is easily calculated as: p(yn+1 |y1:n ) = p(yn+1 |y(r) )P (rn |y1:n ) . (2) rn Thus, the predictions from BOCPD fully integrate out any uncertainty in θ. The message updates (1) perform exact inference under a model where the number of change points is not known a priori. BOCPD RCS Model We show the Rice UPM as an example as it is required for our application. The data within a regime are assumed to be iid Rice observations, with a normal-gamma prior: yn ∼ Rice(ν, σ) , ν ∼ N (µ0 , σ 2 /λ0 ) , σ −2 =: τ ∼ Gamma(α0 , β0 ) (3) 2 =⇒ p(yn |ν, σ) = yn τ exp(−τ (yn + ν 2 )/2)I0 (yn ντ )I{yn ≥ 0} (4) where I0 (·) is a modiﬁed Bessel function of order zero, which is what excludes the Rice distribution from the exponential family. Although the normal-gamma is not conjugate to a Rice it will enable us to use the VB-EM algorithm. The UPM parameters are the Rice shape1 ν ∈ R and scale σ ∈ R+ , θ := {ν, σ}, and the hyper-parameters are the normal-gamma parameters Θm := {µ0 , λ0 , α0 , β0 }. Every change point results in a new value for ν and σ being sampled. A posterior on θ is maintained for each run length, i.e. every possible starting point for the current regime, and is updated at each new data point. Therefore, BOCPD maintains n distinct posteriors on θ, and although this can be reduced with pruning, it necessitates posterior updates on θ that are computationally efﬁcient. Note that the run length updates in (1) require the UPM to provide predictive log likelihoods at all sample sizes rn (including zero). Therefore, UPM implementations using such approximations as plug-in MLE predictions will not work very well. The MLE may not even be deﬁned for run lengths smaller than the number of UPM parameters |θ|. For a Rice UPM, the efﬁcient O(1) updating in exponential family models by using a conjugate prior and accumulating sufﬁcient statistics is not possible. This motivates the use of VB methods for approximating the UPM predictions. 1.2 Variational Bayes We follow the framework of VB where when computation of the exact posterior distribution p(θ|y1:n ) is intractable it is often possible to create a variational approximation q(θ) that is locally optimal in terms of the Kullback-Leibler (KL) divergence KL(q p) while constraining q to be in a certain family of distributions Q. In general this is done by optimizing a lower bound L(q) on the evidence log p(y1:n ), using either gradient based methods or standard ﬁxed point equations. 1 The shape ν is usually assumed to be positive (∈ R+ ); however, there is nothing wrong with using a negative ν as Rice(x|ν, σ) = Rice(x|−ν, σ). It also allows for use of a normal-gamma prior. 3 The VB-EM Algorithm In many cases, such as the Rice UPM, the derivation of the VB ﬁxed point equations can be simpliﬁed by applying the VB-EM algorithm [3]. VB-EM is applicable to models that are conjugate-exponential (CE) after being augmented with latent variables x1:n . A model is CE if: 1) The complete data likelihood p(x1:n , y1:n |θ) is an exponential family distribution; and 2) the prior p(θ) is a conjugate prior for the complete data likelihood p(x1:n , y1:n |θ). We only have to constrain the posterior q(θ, x1:n ) = q(θ)q(x1:n ) to factorize between the latent variables and the parameters; we do not constrain the posterior to be of any particular parametric form. Requiring the complete likelihood to be CE is a much weaker condition than requiring the marginal on the observed data p(y1:n |θ) to be CE. Consider a mixture of Gaussians: the model becomes CE when augmented with latent variables (class labels). This is also the case for the Rice distribution (Section 2.1). Like the ordinary EM algorithm [9] the VB-EM algorithm alternates between two steps: 1) Find the posterior of the latent variables treating the expected natural parameters η := Eq(θ) [η] as correct: ¯ q(xi ) ← p(xi |yi , η = η ). 2) Find the posterior of the parameters using the expected sufﬁcient statis¯ ¯ tics S := Eq(x1:n ) [S(x1:n , y1:n )] as if they were the sufﬁcient statistics for the complete data set: ¯ q(θ) ← p(θ|S(x1:n , y1:n ) = S). The posterior will be of the same exponential family as the prior. 1.3 Tracking In this section we review data association, which along with ﬁltering constitutes tracking. In data association we estimate the association vectors a which map measurements to tracks. At each time NZ (n) step, n ∈ N1 , we observe NZ (n) ∈ N0 measurements, Zn = {zi,n }i=1 , which includes returns from both real targets and clutter (spurious measurements). Here, zi,n ∈ Z is a vector of kinematic measurements (positions in R3 , or R4 with a Doppler), augmented with an RCS component R ∈ R+ for the measured SNR, at time tn ∈ R. The assignment vector at time tn is such that an (i) = j if measurement i is associated with track j > 0; an (i) = 0 if measurement i is clutter. The inverse mapping a−1 maps tracks to measurements: meaning a−1 (an (i)) = i if an (i) = 0; and n n a−1 (i) = 0 ⇔ an (j) = i for all j. For example, if NT = 4 and a = [2 0 0 1 4] then NZ = 5, n Nc = 2, and a−1 = [4 1 0 5]. Each track is associated with at most one measurement, and vice-versa. In N D data association we jointly ﬁnd the MAP estimate of the association vectors over a sliding window of the last N − 1 time steps. We assume we have NT (n) ∈ N0 total tracks as a known parameter: NT (n) is adjusted over time using various algorithms (see [2, Ch. 3]). In the generative process each track places a probability distribution on the next N − 1 measurements, with both kinematic and RCS components. However, if the random RCS R for a measurement is below R0 then it will not be observed. There are Nc (n) ∈ N0 clutter measurements from a Poisson process with λ := E[Nc (n)] (often with uniform intensity). The ordering of measurements in Zn is assumed to be uniformly random. For 3D data association the model joint p(Zn−1:n , an−1 , an |Z1:n−2 ) is: NT |Zi | n pi (za−1 (i),n , za−1 n n−1 i=1 (i),n−1 ) × λNc (i) exp(−λ)/|Zi |! i=n−1 p0 (zj,i )I{ai (j)=0} , (5) j=1 where pi is the probability of the measurement sequence under track i; p0 is the clutter distribution. The probability pi is the product of the RCS component predictions (BOCPD) and the kinematic components (ﬁlter); informally, pi (z) = pi (positions) × pi (RCS). If there is a missed detection, i.e. a−1 (i) = 0, we then use pi (za−1 (i),n ) = P (R < R0 ) under the RCS model for track i with no conn n tribution from positional (kinematic) component. Just as BOCPD allows any black box probabilistic predictor to be used as a UPM, any black box model of measurement sequences can used in (5). The estimation of association vectors for the 3D case becomes an optimization problem of the form: ˆ (ˆn−1 , an ) = argmax log P (an−1 , an |Z1:n ) = argmax log p(Zn−1:n , an−1 , an |Z1:n−2 ) , (6) a (an−1 ,an ) (an−1 ,an ) which is effectively optimizing (5) with respect to the assignment vectors. The optimization given in (6) can be cast as a multidimensional assignment (MDA) problem [2], which can be solved efﬁciently in the 2D case. Higher dimensional assignment problems, however, are NP-hard; approximate, yet typically very accurate, solvers must be used for real-time operation, which is usually required for tracking systems [20]. If a radar scan occurs at each time step and a target is not detected, we assume the SNR has not exceeded the threshold, implying 0 ≤ R < R0 . This is a (left) censored measurement and is treated differently than a missing data point. Censoring is accounted for in Section 2.3. 4 2 Online Variational UPMs We cover the four technical challenges for implementing non-exponential family UPMs in an efﬁcient and online manner. We drop the index of the data point i when it is clear from context. 2.1 Variational Posterior for a Rice Distribution The Rice distribution has the property that x ∼ N (ν, σ 2 ) , y ∼ N (0, σ 2 ) =⇒ R = x2 + y 2 ∼ Rice(ν, σ) . (7) For simplicity we perform inference using R2 , as opposed to R, and transform accordingly: x ∼ N (ν, σ 2 ) , 1 R2 − x2 ∼ Gamma( 2 , τ ) , 2 τ := 1/σ 2 ∈ R+ =⇒ p(R2 , x) = p(R2 |x)p(x) = Gamma(R2 − x2 | 1 , τ )N (x|ν, σ 2 ) . 2 2 (8) The complete likelihood (8) is the product of two exponential family models and is exponential family itself, parameterized with base measure h and partition factor g: η = [ντ, −τ /2] , S = [x, R2 ] , h(R2 , x) = (2π R2 − x2 )−1 , g(ν, τ ) = τ exp(−ν 2 τ /2) . By inspection we see that the natural parameters η and sufﬁcient statistics S are the same as a Gaussian with unknown mean and variance. Therefore, we apply the normal-gamma prior on (ν, τ ) as it is the conjugate prior for the complete data likelihood. This allows us to apply the VB-EM 2 algorithm. We use yi := Ri as the VB observation, not Ri as in (3). In (5), z·,· (end) is the RCS R. VB M-Step We derive the posterior updates to the parameters given expected sufﬁcient statistics: n λ0 µ0 + i E[xi ] , λn = λ0 + n , αn = α0 + n , λ0 + n i=1 n n 1 1 nλ0 1 βn = β0 + (E[xi ] − x)2 + ¯ (¯ − µ0 )2 + x R2 − E[xi ]2 . 2 i=1 2 λ0 + n 2 i=1 i x := ¯ E[xi ]/n , µn = (9) (10) This is the same as an observation from a Gaussian and a gamma that share a (inverse) scale τ . 2 2 ¯ VB E-Step We then must ﬁnd both expected sufﬁcient statistics S. The expectation E[Ri |Ri ] = 2 2 Ri trivially; leaving E[xi |Ri ]. Recall that the joint on (x, y ) is a bivariate normal; if we constrain the radius to R, the angle ω will be distributed by a von Mises (VM) distribution. Therefore, ω := arccos(x/R) ∼ VM(0, κ) , κ = R E[ντ ] =⇒ E[x] = R E[cos ω] = RI1 (κ)/I0 (κ) , (11) where computing κ constitutes the VB E-step and we have used the trigonometric moment on ω [18]. This completes the computations required to do the VB updates on the Rice posterior. Variational Lower Bound For completeness, and to assess convergence, we derive the VB lower bound L(q). Using the standard formula [4] for L(q) = Eq [log p(y1:n , x1:n , θ)] + H[q] we get: n 2 1 E[log τ /2] − 1 E[τ ]Ri + (E[ντ ] − κi /Ri )E[xi ] − 2 E[ν 2 τ ] + log I0 (κi ) − KL(q p) , 2 (12) i=1 where p in the KL is the prior on (ν, τ ) which is easy to compute as q and p are both normal-gamma. Equivalently, (12) can be optimized directly instead of using the VB-EM updates. 2.2 Online Variational Inference In Section 2.1 we derived an efﬁcient way to compute the variational posterior for a Rice distribution for a ﬁxed data set. However, as is apparent from (1) we need online predictions from the UPM; we must be able to update the posterior one data point at a time. When the UPM is exponential family and we can compute the posterior exactly, we merely use the posterior from the previous step as the prior. However, since we are only computing a variational approximation to the posterior, using the previous posterior as the prior does not give the exact same answer as re-computing the posterior from batch. This gives two obvious options: 1) recompute the posterior from batch every update at O(n) cost or 2) use the previous posterior as the prior at O(1) cost and reduced accuracy. 5 The difference between the options is encapsulated by looking at the expected sufﬁcient statistics: n ¯ S = i=1 Eq(xi |y1:n ) [S(xi , yi )]. Naive online updating uses old expected sufﬁcient statistics whose n ¯ posterior effectively uses S = i=1 Eq(xi |y1:i ) [S(xi , yi )]. We get the best of both worlds if we adjust those estimates over time. We in fact can do this if we project the expected sufﬁcient statistics into a “feature space” in terms of the expected natural parameters. For some function f , q(xi ) = p(xi |yi , η = η ) =⇒ Eq(xi |y1:n ) [S(xi , yi )] = f (yi , η ) . ¯ ¯ If f is piecewise continuous then we can represent it with an inner product [8, Sec. 2.1.6] n n ¯ f (yi , η ) = φ(¯) ψ(yi ) =⇒ S = ¯ η φ(¯) ψ(yi ) = φ(¯) η η ψ(yi ) , i=1 i=1 (13) (14) where an inﬁnite dimensional φ and ψ may be required for exact representation, but can be approximated by a ﬁnite inner product. In the Rice distribution case we use (11) f (yi , η ) = E[xi ] = Ri I (Ri E[ντ ]) = Ri I ((Ri /µ0 ) µ0 E[ντ ]) , ¯ I (·) := I1 (·)/I0 (·) , (15) 2 Ri where recall that yi = and η1 = E[ντ ]. We can easily represent f with an inner product if we can ¯ represent I as an inner product: I (uv) = φ(u) ψ(v). We use unitless φi (u) = I (ci u) with c1:G as a log-linear grid from 10−2 to 103 and G = 50. We use a lookup table for ψ(v) that was trained to match I using non-negative least squares, which left us with a sparse lookup table. Online updating for VB posteriors was also developed in [24; 13]. These methods involved introducing forgetting factors to forget the contributions from old data points that might be detrimental to accuracy. Since the VB predictions are “embedded” in a change point method, they are automatically phased out if the posterior predictions become inaccurate making the forgetting factors unnecessary. 2.3 Censored Data As mentioned in Section 1.3, we must handle censored RCS observations during a missed detection. In the VB-EM framework we merely have to compute the expected sufﬁcient statistics given the censored measurement: E[S|R < R0 ]. The expected sufﬁcient statistic from (11) is now: R0 E[x|R < R0 ] = 0 ν ν E[x|R]p(R)dR RiceCDF (R0 |ν, τ ) = ν(1 − Q2 ( σ , R0 ))/(1 − Q1 ( σ , R0 )) , σ σ where QM is the Marcum Q function [17] of order M . Similar updates for E[S|R < R0 ] are possible for exponential or gamma UPMs, but are not shown as they are relatively easy to derive. 2.4 Variational Run Length Posteriors: Predictive Log Likelihoods Both updating the BOCPD run length posterior (1) and ﬁnding the marginal predictive log likelihood of the next point (2) require calculating the UPM’s posterior predictive log likelihood log p(yn+1 |rn , y(r) ). The marginal posterior predictive from (2) is used in data association (6) and benchmarking BOCPD against other methods. However, the exact posterior predictive distribution obtained by integrating the Rice likelihood against the VB posterior is difﬁcult to compute. We can break the BOCPD update (1) into a time and measurement update. The measurement update corresponds to a Bayesian model comparison (BMC) calculation with prior p(rn |y1:n ): p(rn |y1:n+1 ) ∝ p(yn+1 |rn , y(r) )p(rn |y1:n ) . (16) Using the BMC results in Bishop [4, Sec. 10.1.4] we ﬁnd a variational posterior on the run length by using the variational lower bound for each run length Li (q) ≤ log p(yn+1 |rn = i, y(r) ), calculated using (12), as a proxy for the exact UPM posterior predictive in (16). This gives the exact VB posterior if the approximating family Q is of the form: q(rn , θ, x) = qUPM (θ, x|rn )q(rn ) =⇒ q(rn = i) = exp(Li (q))p(rn = i|y1:n )/ exp(L(q)) , (17) where qUPM contains whatever constraints we used to compute Li (q). The normalizer on q(rn ) serves as a joint VB lower bound: L(q) = log i exp(Li (q))p(rn = i|y1:n ) ≤ log p(yn+1 |y1:n ). Note that the conditional factorization is different than the typical independence constraint on q. Furthermore, we derive the estimation of the assignment vectors a in (6) as a VB routine. We use a similar conditional constraint on the latent BOCPD variables given the assignment and constrain the assignment posterior to be a point mass. In the 2D assignment case, for example, ˆ q(an , X1:NT ) = q(X1:NT |an )q(an ) = q(X1:NT |an )I{an = an } , (18) 6 2 10 0 10 −1 10 −2 10 10 20 30 40 50 RCS RMSE (dBsm) RCS RMSE (dBsm) 10 KL (nats) 5 10 1 8 6 4 2 3 2 1 0 0 0 100 200 Sample Size (a) Online Updating 4 300 Time (b) Exponential RCS 400 0 100 200 300 400 Time (c) Rice RCS Figure 2: Left: KL from naive updating ( ), Sato’s method [24] ( ), and improved online VB (◦) to the batch VB posterior vs. sample size n; using a standard normal-gamma prior. Each curve represents a true ν in the generating Rice distribution: ν = 3.16 (red), ν = 10.0 (green), ν = 31.6 (blue) and τ = 1. Middle: The RMSE (dB scale) of the estimate on the mean RCS distribution E[Rn ] is plotted for an exponential RCS model. The curves are BOCPD (blue), IMM (black), identity (magenta), α-ﬁlter (green), and median ﬁlter (red). Right: Same as the middle but for the Rice RCS case. The dashed lines are 95% conﬁdence intervals. where each track’s Xi represents all the latent variables used to compute the variational lower bound on log p(zj,n |an (j) = i). In the BOCPD case, Xi := {rn , x, θ}. The resulting VB ﬁxed point ˆ equations ﬁnd the posterior on the latent variables Xi by taking an as the true assignment and solving ˆ the VB problem of (17); the assignment an is found by using (6) and taking the joint BOCPD lower bound L(q) as a proxy for the BOCPD predictive log likelihood component of log pi in (5). 3 3.1 Results Improved Online Solution We ﬁrst demonstrate the accuracy of the online VB approximation (Section 2.2) on a Rice estimation example; here, we only test the VB posterior as no change point detection is applied. Figure 2(a) compares naive online updating, Sato’s method [24], and our improved online updating in KL(online batch) of the posteriors for three different true parameters ν as sample size n increases. The performance curves are the KL divergence between these online approximations to the posterior and the batch VB solution (i.e. restarting VB from “scratch” every new data point) vs sample size. The error for our method stays around a modest 10−2 nats while naive updating incurs large errors of 1 to 50 nats [19, Ch. 4]. Sato’s method tends to settle in around a 1 nat approximation error. The recommended annealing schedule, i.e. forgetting factors, in [24] performed worse than naive updating. We did a grid search over annealing exponents and show the results for the best performing schedule of n−0.52 . By contrast, our method does not require the tuning of an annealing schedule. 3.2 RCS Estimation Benchmarking We now compare BOCPD with other methods for RCS estimation. We use the same experimental example as Slocumb and Klusman III [25], which uses an augmented interacting multiple model (IMM) based method for estimating the RCS; we also compare against the same α-ﬁlter and median ﬁlter used in [25]. As a reference point, we also consider the “identity ﬁlter” which is merely an unbiased ﬁlter that uses only yn to estimate the mean RCS E[Rn ] at time step n. We extend this example to look at Rice RCS in addition to the exponential RCS case. The bias correction constants in the IMM were adjusted for the Rice distribution case as per [25, Sec. 3.4]. The results on exponential distributions used in [25] and the Rice distribution case are shown in Figures 2(b) and 2(c). The IMM used in [25] was hard-coded to expect jumps in the SNR of multiples of ±10 dB, which is exactly what is presented in the example (a sequence of 20, 10, 30, and 10 dB). In [25] the authors mention that the IMM reaches an RMSE “ﬂoor” at 2 dB, yet BOCPD continues to drop as low as 0.56 dB. The RMSE from BOCPD does not spike nearly as high as the other methods upon a change in E[Rn ]. The α-ﬁlter and median ﬁlter appear worse than both the IMM and BOCPD. The RMSE and conﬁdence intervals are calculated from 5000 runs of the experiment. 7 45 80 40 30 Northing (km) Improvement (%) 35 25 20 15 10 5 60 40 20 0 0 −5 1 2 3 4 −20 5 Difficulty 0 20 40 60 80 100 Easting (km) (a) SIAP Metrics (b) Heathrow (LHR) Figure 3: Left: Average relative improvements (%) for SIAP metrics: position accuracy (red ), velocity accuracy (green ), and spurious tracks (blue ◦) across difﬁculty levels. Right: LHR: true trajectories shown as black lines (−), estimates using a BOCPD RCS model for association shown as blue stars (∗), and the standard tracker as red circles (◦). The standard tracker has spurious tracks over east London and near Ipswich. Background map data: Google Earth (TerraMetrics, Data SIO, NOAA, U.S. Navy, NGA, GEBCO, Europa Technologies) 3.3 Flightradar24 Tracking Problem Finally, we used real ﬂight trajectories from ﬂightradar24 and plugged them into our 3D tracking algorithm. We compare tracking performance between using our BOCPD model and the relatively standard constant probability of detection (no RCS) [2, Sec. 3.5] setup. We use the single integrated air picture (SIAP) metrics [6] to demonstrate the improved performance of the tracking. The SIAP metrics are a standard set of metrics used to compare tracking systems. We broke the data into 30 regions during a one hour period (in Sept. 2012) sampled every 5 s, each within a 200 km by 200 km area centered around the world’s 30 busiest airports [22]. Commercial airport trafﬁc is typically very orderly and does not allow aircraft to ﬂy close to one another or cross paths. Feature-aided tracking is most necessary in scenarios with a more chaotic air situation. Therefore, we took random subsets of 10 ﬂight paths and randomly shifted their start time to allow for scenarios of greater interest. The resulting SIAP metric improvements are shown in Figure 3(a) where we look at performance by a difﬁculty metric: the number of times in a scenario any two aircraft come within ∼400 m of each other. The biggest improvements are seen for difﬁculties above three where positional accuracy increases by 30%. Signiﬁcant improvements are also seen for velocity accuracy (11%) and the frequency of spurious tracks (6%). Signiﬁcant performance gains are seen at all difﬁculty levels considered. The larger improvements at level three over level ﬁve are possibly due to some level ﬁve scenarios that are not resolvable simply through more sophisticated models. We demonstrate how our RCS methods prevent the creation of spurious tracks around London Heathrow in Figure 3(b). 4 Conclusions We have demonstrated that it is possible to use sophisticated and recent developments in machine learning such as BOCPD, and use the modern inference method of VB, to produce demonstrable improvements in the much more mature ﬁeld of radar tracking. We ﬁrst closed a “hole” in the literature in Section 2.1 by deriving variational inference on the parameters of a Rice distribution, with its inherent applicability to radar tracking. In Sections 2.2 and 2.4 we showed that it is possible to use these variational UPMs for non-exponential family models in BOCPD without sacriﬁcing its modular or online nature. The improvements in online VB are extendable to UPMs besides a Rice distribution and more generally beyond change point detection. We can use the variational lower bound from the UPM and obtain a principled variational approximation to the run length posterior. Furthermore, we cast the estimation of the assignment vectors themselves as a VB problem, which is in large contrast to the tracking literature. More algorithms from the tracking literature can possibly be cast in various machine learning frameworks, such as VB, and improved upon from there. 8 References [1] Adams, R. P. and MacKay, D. J. (2007). Bayesian online changepoint detection. Technical report, University of Cambridge, Cambridge, UK. [2] Bar-Shalom, Y., Willett, P., and Tian, X. (2011). Tracking and Data Fusion: A Handbook of Algorithms. YBS Publishing. [3] Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics, volume 7, pages 453–464. [4] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer. [5] Braun, J. V., Braun, R., and M¨ ller, H.-G. (2000). Multiple changepoint ﬁtting via quasilikelihood, with u application to DNA sequence segmentation. Biometrika, 87(2):301–314. [6] Byrd, E. (2003). Single integrated air picture (SIAP) attributes version 2.0. Technical Report 2003-029, DTIC. [7] Chen, J. and Gupta, A. (1997). Testing and locating variance changepoints with application to stock prices. Journal of the Americal Statistical Association, 92(438):739–747. [8] Courant, R. and Hilbert, D. (1953). Methods of Mathematical Physics. Interscience. [9] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38. [10] Ehrman, L. M. and Blair, W. D. (2006). Comparison of methods for using target amplitude to improve measurement-to-track association in multi-target tracking. In Information Fusion, 2006 9th International Conference on, pages 1–8. IEEE. [11] Fearnhead, P. and Liu, Z. (2007). Online inference for multiple changepoint problems. Journal of the Royal Statistical Society, Series B, 69(4):589–605. [12] Hipp, C. (1974). Sufﬁcient statistics and exponential families. The Annals of Statistics, 2(6):1283–1292. [13] Honkela, A. and Valpola, H. (2003). On-line variational Bayesian learning. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation, pages 803–808. [14] Kalman, R. E. (1960). A new approach to linear ﬁltering and prediction problems. Transactions of the ASME — Journal of Basic Engineering, 82(Series D):35–45. [15] Lauwers, L., Barb´ , K., Van Moer, W., and Pintelon, R. (2009). Estimating the parameters of a Rice e distribution: A Bayesian approach. In Instrumentation and Measurement Technology Conference, 2009. I2MTC’09. IEEE, pages 114–117. IEEE. [16] Mahler, R. (2003). Multi-target Bayes ﬁltering via ﬁrst-order multi-target moments. IEEE Trans. AES, 39(4):1152–1178. [17] Marcum, J. (1950). Table of Q functions. U.S. Air Force RAND Research Memorandum M-339, Rand Corporation, Santa Monica, CA. [18] Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics. John Wiley & Sons, New York. [19] Murray, I. (2007). Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby computational neuroscience unit, University College London, London, UK. [20] Poore, A. P., Rijavec, N., Barker, T. N., and Munger, M. L. (1993). Data association problems posed as multidimensional assignment problems: algorithm development. In Optical Engineering and Photonics in Aerospace Sensing, pages 172–182. International Society for Optics and Photonics. [21] Richards, M. A., Scheer, J., and Holm, W. A., editors (2010). Principles of Modern Radar: Basic Principles. SciTech Pub. [22] Rogers, S. (2012). The world’s top 100 airports: listed, ranked and mapped. The Guardian. [23] Saatci, Y., Turner, R., and Rasmussen, C. E. (2010). Gaussian process change point models. In 27th ¸ International Conference on Machine Learning, pages 927–934, Haifa, Israel. Omnipress. [24] Sato, M.-A. (2001). Online model selection based on the variational Bayes. Neural Computation, 13(7):1649–1681. [25] Slocumb, B. J. and Klusman III, M. E. (2005). A multiple model SNR/RCS likelihood ratio score for radar-based feature-aided tracking. In Optics & Photonics 2005, pages 59131N–59131N. International Society for Optics and Photonics. [26] Swerling, P. (1954). Probability of detection for ﬂuctuating targets. Technical Report RM-1217, Rand Corporation. [27] Turner, R. (2011). Gaussian Processes for State Space Models and Change Point Detection. PhD thesis, University of Cambridge, Cambridge, UK. 9

3 0.71506137 173 nips-2013-Least Informative Dimensions

Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda

Abstract: We present a novel non-parametric method for ﬁnding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1

4 0.71371931 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

Author: David Pfau, Eftychios A. Pnevmatikakis, Liam Paninski

Abstract: Recordings from large populations of neurons make it possible to search for hypothesized low-dimensional dynamics. Finding these dynamics requires models that take into account biophysical constraints and can be ﬁt efﬁciently and robustly. Here, we present an approach to dimensionality reduction for neural data that is convex, does not make strong assumptions about dynamics, does not require averaging over many trials and is extensible to more complex statistical models that combine local and global inﬂuences. The results can be combined with spectral methods to learn dynamical systems models. The basic method extends PCA to the exponential family using nuclear norm minimization. We evaluate the effectiveness of this method using an exact decomposition of the Bregman divergence that is analogous to variance explained for PCA. We show on model data that the parameters of latent linear dynamical systems can be recovered, and that even if the dynamics are not stationary we can still recover the true latent subspace. We also demonstrate an extension of nuclear norm minimization that can separate sparse local connections from global latent dynamics. Finally, we demonstrate improved prediction on real neural data from monkey motor cortex compared to ﬁtting linear dynamical models without nuclear norm smoothing. 1

5 0.70916408 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

Author: Ben Shababo, Brooks Paige, Ari Pakman, Liam Paninski

Abstract: With the advent of modern stimulation techniques in neuroscience, the opportunity arises to map neuron to neuron connectivity. In this work, we develop a method for efﬁciently inferring posterior distributions over synaptic strengths in neural microcircuits. The input to our algorithm is data from experiments in which action potentials from putative presynaptic neurons can be evoked while a subthreshold recording is made from a single postsynaptic neuron. We present a realistic statistical model which accounts for the main sources of variability in this experiment and allows for signiﬁcant prior information about the connectivity and neuronal cell types to be incorporated if available. Due to the technical challenges and sparsity of these systems, it is important to focus experimental time stimulating the neurons whose synaptic strength is most ambiguous, therefore we also develop an online optimal design algorithm for choosing which neurons to stimulate at each trial. 1

6 0.70524156 86 nips-2013-Demixing odors - fast inference in olfaction

7 0.70331842 201 nips-2013-Multi-Task Bayesian Optimization

8 0.70222116 148 nips-2013-Latent Maximum Margin Clustering

9 0.70175433 183 nips-2013-Mapping paradigm ontologies to and from the brain

10 0.70091736 350 nips-2013-Wavelets on Graphs via Deep Learning

11 0.70076388 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

12 0.69984692 143 nips-2013-Integrated Non-Factorized Variational Inference

13 0.69877946 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

14 0.69749683 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

15 0.69726586 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

16 0.69660944 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

17 0.69634521 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models

18 0.69550478 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

19 0.69535142 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

20 0.69489157 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models