nips nips2011 nips2011-68 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wieland Brendel, Ranulfo Romo, Christian K. Machens
Abstract: In many experiments, the data points collected live in high-dimensional observation spaces, yet can be assigned a set of labels or parameters. In electrophysiological recordings, for instance, the responses of populations of neurons generally depend on mixtures of experimentally controlled parameters. The heterogeneity and diversity of these parameter dependencies can make visualization and interpretation of such data extremely difficult. Standard dimensionality reduction techniques such as principal component analysis (PCA) can provide a succinct and complete description of the data, but the description is constructed independent of the relevant task variables and is often hard to interpret. Here, we start with the assumption that a particularly informative description is one that reveals the dependency of the high-dimensional data on the individual parameters. We show how to modify the loss function of PCA so that the principal components seek to capture both the maximum amount of variance about the data, while also depending on a minimum number of parameters. We call this method demixed principal component analysis (dPCA) as the principal components here segregate the parameter dependencies. We phrase the problem as a probabilistic graphical model, and present a fast Expectation-Maximization (EM) algorithm. We demonstrate the use of this algorithm for electrophysiological data and show that it serves to demix the parameter-dependence of a neural population response. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In electrophysiological recordings, for instance, the responses of populations of neurons generally depend on mixtures of experimentally controlled parameters. [sent-3, score-0.273]
2 The heterogeneity and diversity of these parameter dependencies can make visualization and interpretation of such data extremely difficult. [sent-4, score-0.164]
3 Standard dimensionality reduction techniques such as principal component analysis (PCA) can provide a succinct and complete description of the data, but the description is constructed independent of the relevant task variables and is often hard to interpret. [sent-5, score-0.224]
4 We show how to modify the loss function of PCA so that the principal components seek to capture both the maximum amount of variance about the data, while also depending on a minimum number of parameters. [sent-7, score-0.356]
5 We call this method demixed principal component analysis (dPCA) as the principal components here segregate the parameter dependencies. [sent-8, score-0.63]
6 We demonstrate the use of this algorithm for electrophysiological data and show that it serves to demix the parameter-dependence of a neural population response. [sent-10, score-0.197]
7 In fMRI data or electrophysiological data from awake behaving humans and animals, for instance, the multivariate data may be the voxels of brain activity or the firing rates of a population of neurons, and the parameters may be sensory stimuli, behavioral choices, or simply the passage of time. [sent-12, score-0.23]
8 Such data sets can be analyzed with principal component analysis (PCA) and related dimensionality reduction methods [4, 2]. [sent-14, score-0.194]
9 On the other hand, dimensionality reduction methods that can take parameters into account, such as canonical correlation analysis (CCA) or partial least squares (PLS) [1, 5], impose a specific model of how the data depend on the parameters (e. [sent-17, score-0.127]
10 We illustrate these issues with neural recordings collected from the prefrontal cortex (PFC) of monkeys performing a two-frequency discrimination task [9, 3, 7]. [sent-20, score-0.135]
11 In this experiment a monkey received 1 two mechanical vibrations with frequencies f1 and f2 on its fingertip, delayed by three seconds. [sent-21, score-0.079]
12 The monkey then had to make a binary decision d depending on whether f1 > f2 . [sent-22, score-0.131]
13 The firing rates of three neurons (out of a total of 842) are plotted in Fig. [sent-24, score-0.124]
14 The responses of the neurons mix information about the different task parameters, a common observation for data sets of recordings in higher-order brain areas, and a problem that exacerbates interpretation of the data. [sent-26, score-0.192]
15 Here we address this problem by modifying PCA such that the principal components depend on individual task parameters while still capturing as much variance as possible. [sent-27, score-0.385]
16 Previous work has addressed the question of how to demix data depending on two [7] or several parameters [8], but did not allow components that capture nonlinear mixtures of parameters. [sent-28, score-0.305]
17 2 Principal component analysis and the demixing problem The firing rates of the neurons in our dataset depend on three external parameters: the time t, the stimulus s = f1 , and the decision d of the monkey. [sent-30, score-0.531]
18 We omit the second frequency f2 since this parameter is highly correlated with f1 and d (the monkey makes errors in < 10% of the trials). [sent-31, score-0.079]
19 Each sample of firing rates in the population, yn , is therefore tagged with parameter values (tn , sn , dn ). [sent-32, score-0.196]
20 For notational simplicity, we will assume that each data point is associated with a unique set of parameter values so that the parameter values themselves can serve as indices for the data points yn . [sent-33, score-0.117]
21 In turn, we drop the index n, and simply write ytsd . [sent-34, score-0.377]
22 The main aim of PCA is to find a new coordinate system in which the data can be represented in a more succinct and compact fashion. [sent-35, score-0.103]
23 The covariance matrix of the firing rates summarizes the second-order statistics of the data set, C = ytsd ytsd (1) tsd and has size D × D where D is the number of neurons in the data set (we will assume the data are centered throughout the paper). [sent-36, score-0.976]
24 Given the covariance matrix, we can compute the firing rate variance that falls along arbitrary directions in state space. [sent-38, score-0.143]
25 For instance, the variance captured by a coordinate axis given by a normalized vector w is simply L = w Cw. [sent-39, score-0.157]
26 (2) The first principal component corresponds to the axis that captures most of the variance of the data, and thereby maximizes the function L subject to the normalization constraint w w = 1. [sent-40, score-0.357]
27 The second principal component maximizes variance in the orthogonal subspace and so on [4, 2]. [sent-41, score-0.317]
28 PCA succeeds nicely in summarizing the population response for our data set: the first ten principal components capture more than 90% of the variance of the data. [sent-42, score-0.407]
29 Whether firing rates have changed due to the first stimulus frequency s = f1 , due to the passage of time, t, or due to the decision, d, they will enter equally into the computation of the covariance matrix and therefore do not influence the choice of the coordinate system constructed by PCA. [sent-44, score-0.242]
30 To clarify this observation, we will segregate the data ytsd into pieces capturing the variability caused by different parameters. [sent-45, score-0.443]
31 The rainbow colors indicate different stimulus frequencies f1 , black and gray indicate the decision of the monkey during the interval [3. [sent-51, score-0.221]
32 (Bottom row) Relative contribution of time (blue), stimulus (light blue), decision (green), and non-linear mixtures (yellow) to the total variance for a sample of 14 neurons (left), the top 14 principal components (middle), and naive demixing (right). [sent-54, score-0.807]
33 In yts , we subtract all variation due to t or s individually, leaving only variation that depends on combined changes of (t, s). [sent-56, score-0.089]
34 These marginalized averages are orthogonal so that ∀φ, φ ⊆ S ¯ ¯ yφ yφ = 0 if φ=φ. [sent-57, score-0.264]
35 (6) At the same time, their sum reconstructs the original data, ¯ ¯ ¯ ¯ ¯ ¯ ¯ ytsd = yt + ys + yd + yts + ytd + ysd + ytsd . [sent-58, score-1.195]
36 (7) The latter two properties allow us to segregate the covariance matrix of ytsd into ‘marginalized covariance matrices’ that capture the variance in a subset of parameters φ ⊆ S, C = Ct + Cs + Cd + Cts + Ctd + Csd + Ctsd , with ¯ ¯ Cφ = yφ yφ . [sent-59, score-0.7]
37 For a given component w, the marginalized covariance matrices allow us to calculate the variance xφ of w conditioned on φ ⊆ S as x2 = w Cφ w, φ so that the total variance is given by L = φ x2 =: x 2 . [sent-62, score-0.476]
38 2 φ Using this segregation, we are able to examine the distribution of variance in the PCA components and the original data. [sent-63, score-0.193]
39 The left plot shows that individual neurons carry varying degree of information about the different task parameters, reaffirming the heterogeneity of neural responses. [sent-66, score-0.119]
40 To improve visualization of the data and to facilitate the interpretation of individual components, we would prefer components that depend on only a single parameter, or, more generally, that depend on the smallest number of parameters possible. [sent-68, score-0.227]
41 At the same time, we would want to keep the attractive properties of PCA in which every component captures as much variance as possible about the data. [sent-69, score-0.15]
42 Naively, we could simply combine eigenvectors from the marginalized covariance matrices. [sent-70, score-0.268]
43 For example, consider the first Q eigenvectors of each marginalized covariance matrix. [sent-71, score-0.268]
44 Whether a solution falls into the center or along the axis does not matter, as long as it captures a maximum of overall variance. [sent-74, score-0.077]
45 The dPCA objective functions (with parameters λ = 1 and λ = 4) prefer solutions along the axes over solutions in the center, even if the solutions along the axes capture less overall variance. [sent-75, score-0.284]
46 orthogonalization to these eigenvectors and choose the Q coordinates that capture the most variance. [sent-76, score-0.095]
47 While the parameter dependence of the components is sparser than in PCA, there is a strong bias towards time, and variance induced by the decision of the monkey is squeezed out. [sent-79, score-0.324]
48 As a further drawback, naive demixing covers only 84. [sent-80, score-0.175]
49 3 Demixed principal component analysis (dPCA): Loss function With respect to the segregated covariances, the PCA objective function, Eq. [sent-84, score-0.272]
50 This function is illustrated in Fig 2 (left), and φ shows that PCA will maximize variance, no matter whether this variance comes about through a single marginalized variance, or through mixtures thereof. [sent-86, score-0.392]
51 Consequently, we need to modify this objective function such that solutions w that do not mix variances—thereby falling along one of the axes in x-space—are favored over solutions w that fall into the center in x-space. [sent-87, score-0.214]
52 Hence, we seek an objective function L = L(x) that grows monotonically with any xφ such that more variance is better, just as in PCA, and that grows faster along the axes than in the center so that mixtures of variances get punished. [sent-88, score-0.331]
53 Here, solutions w that lead to mixtures of variances are punished against solutions that do not mix variances. [sent-92, score-0.212]
54 LdPCA = x 2 2 Note that the objective function is a function of the coordinate axis w, and the aim is to maximize LdPCA with respect to w. [sent-93, score-0.191]
55 , wQ is straightforward by maximizing L in steps for every component and ensuring orthonormality by means of symmetric orthogonalization [6] after each step. [sent-97, score-0.095]
56 We call the resulting algorithm demixed principal component analysis (dPCA), since it essentially can be seen as a generalization of standard PCA. [sent-98, score-0.327]
57 4 Probabilistic principal component analysis with orthogonality constraint We introduced dPCA by means of a modification of the objective function of PCA. [sent-99, score-0.319]
58 However, we aim for a superior algorithm by framing dPCA in a probabilistic framework. [sent-102, score-0.095]
59 Since the probabilistic treatment of dPCA requires two modifications over the conventional expectationmaximization (EM) algorithm for probabilistic PCA (PPCA), we here review PPCA [11, 10], and show how to introduce an explicit orthogonality constraint on the mixing matrix. [sent-105, score-0.193]
60 4 In PPCA, the observed data y are linear combinations of latent variables z y = Wz + y 2 (9) 2 D×Q where y ∼ N (0, σ ID ) is isotropic Gaussian noise with variance σ and W ∈ R is the mixing matrix. [sent-106, score-0.172]
61 The latent variables are assumed to follow a zero-mean, unit-covariance Gaussian prior, p(z) = N (z|0, IQ ). [sent-108, score-0.083]
62 N , and Z = {zn } the corresponding values of the latent variables. [sent-113, score-0.083]
63 Our aim is to maximize the likelihood of the data, p(Y) = n p(yn ), with respect to the parameters W and σ. [sent-114, score-0.119]
64 The posterior distribution p(Z|Y) is again Gaussian and given by N N zn M−1 W yn , σ 2 M−1 p(Z|Y) = with M = W W + σ 2 IQ . [sent-119, score-0.397]
65 (10) n=1 Mean and covariance can be read off the arguments, and we note in particular that E[zn zn ] = σ 2 M−1 + E[zn ]E[zn ] . [sent-120, score-0.334]
66 We can then take the expectation of the complete-data log likelihood with respect to this posterior distribution, so that N E ln p Y, Z W, σ 2 D 1 ln 2πσ 2 + 2 yn 2 2σ =− n=1 2 − 1 E [zn ] W yn σ2 (11) 1 Q 1 + 2 Tr E zn zn W W + ln (2π) + Tr E zn zn 2σ 2 2 . [sent-121, score-1.636]
67 For σ, we obtain (σ ∗ )2 = 1 ND N yn 2 − 2E [zn ] W yn + Tr E zn zn W W . [sent-125, score-0.794]
68 (12) n=1 For W, we need to deviate from the conventional PPCA algorithm, since the development of probabilistic dPCA requires an explicit orthogonality constraint on W, which had so far not been included in PPCA. [sent-126, score-0.142]
69 To impose this constraint, we factorize W into an orthogonal and a diagonal matrix, W = UΓ, U U = ID (13) where U ∈ RD×Q has orthogonal columns of unit length and Γ ∈ RQ×Q is diagonal. [sent-127, score-0.106]
70 Here, the data y are projected on a subspace z of latent variables. [sent-136, score-0.083]
71 Each latent variable zi depends on a set of parameters θj ∈ S. [sent-137, score-0.159]
72 To ease interpretation of the latent variables zi , we impose a sparse mapping between the parameters and the latent variables. [sent-138, score-0.315]
73 5 Probabilistic demixed principal component analysis We described a PPCA EM-algorithm with an explicit constraint on the orthogonality of the columns of W. [sent-141, score-0.418]
74 So far, variance due to different parameters in the data set are completely mixed in the latent variables z. [sent-142, score-0.232]
75 The essential idea of dPCA is to demix these parameter dependencies by sparsifying the mapping from parameters to latent variables (see Fig. [sent-143, score-0.243]
76 Since we do not want to impose the nature of this mapping (which is to remain non-parametric), we suggest a model in which each latent variable zi is segregated into (and replaced by) a set of R latent variables {zφ,i }, each of which depends on a subset φ ⊆ S of parameters. [sent-145, score-0.294]
77 The priors over the latent variables are specified as p(zφ ) = N (zφ |0, diagΛφ ) (20) where Λφ is a row in Λ ∈ RR×Q , the matrix of variances for all latent variables. [sent-149, score-0.211]
78 The covariance of the sum of the latent variables shall again be the identity matrix, diag Λφ = IQ . [sent-150, score-0.217]
79 As before, we will use the EM-algorithm to maximize the model evidence p(Y) with respect to the parameters Λ, W, σ. [sent-152, score-0.075]
80 However, we additionally impose that each column Λi of Λ shall be sparse, thereby ensuring that the diversity of parameter dependencies of the latent variables zi = φ zφ,i is reduced. [sent-153, score-0.266]
81 Due to the implicit parameter dependencies of the latent variables, the sets of variables Zφ = {zn } can only depend on the respective marginalized averages of the data. [sent-158, score-0.406]
82 For three parameters, yn the marginalized averages were specified in Eq. [sent-160, score-0.347]
83 2 In turn, the posterior of Zφ takes the form N ¯n N zn M−1 W yφ , σ 2 M−1 φ φ φ ¯ p(Zφ |Yφ ) = (24) n=1 where Mφ = W W + σ 2 diag Λ−1 . [sent-165, score-0.36]
84 (11), N E ln p Y, Z W, σ 2 =− n=1 1 D ln 2πσ 2 + 2 yn 2 2σ 2 + φ⊆S Q ln (2π) 2 1 1 + 2 Tr E zn zn W W − 2 E zn W yn φ φ φ 2σ σ 1 1 −1 + ln det diag (Λφ ) + Tr E zn zn diag (Λφ ) φ φ 2 2 (26) . [sent-167, score-2.17]
85 (26) shows that the maximum-likelihood estimates of W = UΓ and of σ 2 are unchanged (this can be seen by substituting z for the sum of marginalized averages, φ zφ , so that E [z] = φ E [zφ ] and E[zz ] = φ E[zφ zφ ]). [sent-171, score-0.183]
86 Second, since we aim for components depending only on a small subset of parameters, we have to introduce another constraint to promote sparsity of Λi . [sent-175, score-0.183]
87 On the right the firing rates of six dPCA components are displayed in three columns separated into components with the highest variance in time (left), in decision (middle) and in the stimulus (right). [sent-188, score-0.474]
88 6 Experimental results The results of the dPCA algorithm applied to the electrophysiological data from the PFC are shown in Fig. [sent-193, score-0.08]
89 With 90% of the total variance in the first fourteen components, dPCA captures a comparable amount of variance as PCA (91. [sent-195, score-0.236]
90 The distribution of variances in the dPCA components is shown in Fig. [sent-197, score-0.149]
91 Note that, compared with the distribution in the PCA components (Fig. [sent-199, score-0.104]
92 1, bottom, center), the dPCA components clearly separate the different sources of variability. [sent-200, score-0.104]
93 More specifically, the neural population is dominated by components that only depend on time (blue), yet also features separate components for the monkey’s decision (green) and the perception of the stimulus (light blue). [sent-201, score-0.43]
94 The components of dPCA, of which the six most prominent are displayed in Fig. [sent-202, score-0.104]
95 4, right, therefore reflect and separate the parameter dependencies of the data, even though these dependencies were completely intermingled on the single neuron level (compare Fig. [sent-203, score-0.158]
96 Our study was motivated by the specific problems related to electrophysiological data sets. [sent-206, score-0.08]
97 The main aim of our method—demixing parameter dependencies of high-dimensional data sets—may be useful in other context as well. [sent-207, score-0.108]
98 Furthermore, the general aim of demixing dependencies could likely be extended to other methods (such as ICA) as well. [sent-209, score-0.283]
99 Ultimately, we see dPCA as a particular data visualization technique that will prove useful if a demixing of parameter dependencies aids in understanding data. [sent-210, score-0.239]
100 Timing and neural encoding of somatosensory parametric working memory in macaque prefrontal cortex. [sent-231, score-0.067]
wordName wordTfidf (topN-words)
[('dpca', 0.532), ('ytsd', 0.377), ('zn', 0.28), ('marginalized', 0.183), ('demixing', 0.175), ('pca', 0.169), ('demixed', 0.133), ('principal', 0.133), ('ring', 0.119), ('iq', 0.117), ('yn', 0.117), ('ppca', 0.111), ('components', 0.104), ('ln', 0.094), ('stimulus', 0.09), ('variance', 0.089), ('neurons', 0.089), ('yts', 0.089), ('latent', 0.083), ('diag', 0.08), ('electrophysiological', 0.08), ('ys', 0.08), ('yd', 0.08), ('monkey', 0.079), ('mixtures', 0.075), ('prefrontal', 0.067), ('demix', 0.066), ('pfc', 0.066), ('romo', 0.066), ('segregate', 0.066), ('ysd', 0.066), ('ytd', 0.066), ('dependencies', 0.064), ('id', 0.062), ('component', 0.061), ('yt', 0.06), ('fourteen', 0.058), ('pls', 0.058), ('orthogonality', 0.056), ('covariance', 0.054), ('wz', 0.054), ('decision', 0.052), ('population', 0.051), ('probabilistic', 0.051), ('axes', 0.05), ('averages', 0.047), ('zi', 0.046), ('variances', 0.045), ('maximize', 0.045), ('brody', 0.044), ('champalimaud', 0.044), ('ldpca', 0.044), ('machens', 0.044), ('segregated', 0.044), ('skewd', 0.044), ('tagged', 0.044), ('tsd', 0.044), ('aim', 0.044), ('blue', 0.043), ('axis', 0.039), ('lisbon', 0.039), ('center', 0.038), ('impose', 0.038), ('maximization', 0.036), ('tr', 0.036), ('recordings', 0.036), ('mexico', 0.036), ('portugal', 0.036), ('yellow', 0.035), ('diversity', 0.035), ('constraint', 0.035), ('interpretation', 0.035), ('rates', 0.035), ('hz', 0.034), ('sd', 0.034), ('objective', 0.034), ('orthogonal', 0.034), ('orthogonalization', 0.034), ('passage', 0.034), ('mix', 0.032), ('monkeys', 0.032), ('cca', 0.032), ('unitary', 0.032), ('eigenvectors', 0.031), ('fmri', 0.03), ('heterogeneity', 0.03), ('normale', 0.03), ('rieure', 0.03), ('succinct', 0.03), ('solutions', 0.03), ('labels', 0.03), ('completely', 0.03), ('light', 0.03), ('capture', 0.03), ('parameters', 0.03), ('depend', 0.029), ('rq', 0.029), ('neuroscience', 0.029), ('coordinate', 0.029), ('bottom', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 68 nips-2011-Demixed Principal Component Analysis
Author: Wieland Brendel, Ranulfo Romo, Christian K. Machens
Abstract: In many experiments, the data points collected live in high-dimensional observation spaces, yet can be assigned a set of labels or parameters. In electrophysiological recordings, for instance, the responses of populations of neurons generally depend on mixtures of experimentally controlled parameters. The heterogeneity and diversity of these parameter dependencies can make visualization and interpretation of such data extremely difficult. Standard dimensionality reduction techniques such as principal component analysis (PCA) can provide a succinct and complete description of the data, but the description is constructed independent of the relevant task variables and is often hard to interpret. Here, we start with the assumption that a particularly informative description is one that reveals the dependency of the high-dimensional data on the individual parameters. We show how to modify the loss function of PCA so that the principal components seek to capture both the maximum amount of variance about the data, while also depending on a minimum number of parameters. We call this method demixed principal component analysis (dPCA) as the principal components here segregate the parameter dependencies. We phrase the problem as a probabilistic graphical model, and present a fast Expectation-Maximization (EM) algorithm. We demonstrate the use of this algorithm for electrophysiological data and show that it serves to demix the parameter-dependence of a neural population response. 1
2 0.14379653 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
3 0.08458586 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
Author: Yan Karklin, Eero P. Simoncelli
Abstract: Efficient coding provides a powerful principle for explaining early sensory coding. Most attempts to test this principle have been limited to linear, noiseless models, and when applied to natural images, have yielded oriented filters consistent with responses in primary visual cortex. Here we show that an efficient coding model that incorporates biologically realistic ingredients – input and output noise, nonlinear response functions, and a metabolic cost on the firing rate – predicts receptive fields and response nonlinearities similar to those observed in the retina. Specifically, we develop numerical methods for simultaneously learning the linear filters and response nonlinearities of a population of model neurons, so as to maximize information transmission subject to metabolic costs. When applied to an ensemble of natural images, the method yields filters that are center-surround and nonlinearities that are rectifying. The filters are organized into two populations, with On- and Off-centers, which independently tile the visual space. As observed in the primate retina, the Off-center neurons are more numerous and have filters with smaller spatial extent. In the absence of noise, our method reduces to a generalized version of independent components analysis, with an adapted nonlinear “contrast” function; in this case, the optimal filters are localized and oriented.
4 0.083956935 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
5 0.082257457 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
Author: Kamiar R. Rad, Liam Paninski
Abstract: Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. We find that for large neural populations carrying a finite total amount of information, the full spiking population response is asymptotically as informative as a single observation from a Gaussian process whose mean and covariance can be characterized explicitly in terms of network and single neuron properties. The Gaussian form of this asymptotic sufficient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. One technical advantage of the theory is that it may be applied easily even to non-Poisson point process network models; for example, we find that under some conditions, neural populations with strong history-dependent (non-Poisson) effects carry exactly the same information as do simpler equivalent populations of non-interacting Poisson neurons with matched firing rates. We argue that our findings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design.
6 0.082239196 302 nips-2011-Variational Learning for Recurrent Spiking Networks
7 0.080122478 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations
8 0.0783692 86 nips-2011-Empirical models of spiking in neural populations
9 0.076821618 275 nips-2011-Structured Learning for Cell Tracking
10 0.075861119 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
11 0.074196547 75 nips-2011-Dynamical segmentation of single trials from population neural data
12 0.07190498 260 nips-2011-Sparse Features for PCA-Like Linear Regression
13 0.071045429 219 nips-2011-Predicting response time and error rates in visual search
14 0.070414014 179 nips-2011-Multilinear Subspace Regression: An Orthogonal Tensor Decomposition Approach
15 0.066793486 301 nips-2011-Variational Gaussian Process Dynamical Systems
16 0.066591784 249 nips-2011-Sequence learning with hidden units in spiking neural networks
17 0.064343497 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
18 0.063216917 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
19 0.060696438 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis
20 0.060096476 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
topicId topicWeight
[(0, 0.168), (1, 0.044), (2, 0.121), (3, -0.064), (4, 0.002), (5, -0.006), (6, 0.044), (7, 0.01), (8, 0.026), (9, 0.043), (10, 0.036), (11, -0.014), (12, 0.026), (13, -0.021), (14, 0.045), (15, -0.047), (16, -0.108), (17, -0.036), (18, 0.039), (19, -0.037), (20, -0.032), (21, -0.072), (22, 0.052), (23, 0.076), (24, -0.013), (25, 0.026), (26, -0.009), (27, 0.096), (28, -0.028), (29, 0.008), (30, -0.094), (31, -0.058), (32, 0.068), (33, -0.05), (34, 0.062), (35, 0.094), (36, 0.024), (37, -0.028), (38, -0.158), (39, -0.045), (40, 0.136), (41, 0.011), (42, -0.035), (43, -0.042), (44, 0.062), (45, 0.015), (46, 0.029), (47, -0.07), (48, -0.096), (49, -0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.93421465 68 nips-2011-Demixed Principal Component Analysis
Author: Wieland Brendel, Ranulfo Romo, Christian K. Machens
Abstract: In many experiments, the data points collected live in high-dimensional observation spaces, yet can be assigned a set of labels or parameters. In electrophysiological recordings, for instance, the responses of populations of neurons generally depend on mixtures of experimentally controlled parameters. The heterogeneity and diversity of these parameter dependencies can make visualization and interpretation of such data extremely difficult. Standard dimensionality reduction techniques such as principal component analysis (PCA) can provide a succinct and complete description of the data, but the description is constructed independent of the relevant task variables and is often hard to interpret. Here, we start with the assumption that a particularly informative description is one that reveals the dependency of the high-dimensional data on the individual parameters. We show how to modify the loss function of PCA so that the principal components seek to capture both the maximum amount of variance about the data, while also depending on a minimum number of parameters. We call this method demixed principal component analysis (dPCA) as the principal components here segregate the parameter dependencies. We phrase the problem as a probabilistic graphical model, and present a fast Expectation-Maximization (EM) algorithm. We demonstrate the use of this algorithm for electrophysiological data and show that it serves to demix the parameter-dependence of a neural population response. 1
2 0.59290838 75 nips-2011-Dynamical segmentation of single trials from population neural data
Author: Biljana Petreska, Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani
Abstract: Simultaneous recordings of many neurons embedded within a recurrentlyconnected cortical network may provide concurrent views into the dynamical processes of that network, and thus its computational function. In principle, these dynamics might be identified by purely unsupervised, statistical means. Here, we show that a Hidden Switching Linear Dynamical Systems (HSLDS) model— in which multiple linear dynamical laws approximate a nonlinear and potentially non-stationary dynamical process—is able to distinguish different dynamical regimes within single-trial motor cortical activity associated with the preparation and initiation of hand movements. The regimes are identified without reference to behavioural or experimental epochs, but nonetheless transitions between them correlate strongly with external events whose timing may vary from trial to trial. The HSLDS model also performs better than recent comparable models in predicting the firing rate of an isolated neuron based on the firing rates of others, suggesting that it captures more of the “shared variance” of the data. Thus, the method is able to trace the dynamical processes underlying the coordinated evolution of network activity in a way that appears to reflect its computational role. 1
3 0.57649177 86 nips-2011-Empirical models of spiking in neural populations
Author: Jakob H. Macke, Lars Buesing, John P. Cunningham, Byron M. Yu, Krishna V. Shenoy, Maneesh Sahani
Abstract: Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multi-electrode recording makes it possible to access these population processes empirically by fitting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where firing exhibits extensive correlations in both time and space and where a typical sample of neurons still reflects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We find that the latent dynamical approach outperforms the GLM in terms of goodness-offit, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, finding that the non-Gaussian model provides slightly better goodness-of-fit and more realistic population spike counts. 1
4 0.57609761 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
5 0.54561943 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations
Author: Emin Orhan, Robert A. Jacobs
Abstract: Extensive evidence suggests that items are not encoded independently in visual short-term memory (VSTM). However, previous research has not quantitatively considered how the encoding of an item influences the encoding of other items. Here, we model the dependencies among VSTM representations using a multivariate Gaussian distribution with a stimulus-dependent mean and covariance matrix. We report the results of an experiment designed to determine the specific form of the stimulus-dependence of the mean and the covariance matrix. We find that the magnitude of the covariance between the representations of two items is a monotonically decreasing function of the difference between the items’ feature values, similar to a Gaussian process with a distance-dependent, stationary kernel function. We further show that this type of covariance function can be explained as a natural consequence of encoding multiple stimuli in a population of neurons with correlated responses. 1
6 0.54444396 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
7 0.51455808 260 nips-2011-Sparse Features for PCA-Like Linear Regression
8 0.50180227 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
9 0.48717567 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
10 0.47261915 301 nips-2011-Variational Gaussian Process Dynamical Systems
11 0.45621356 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm
12 0.43985513 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
13 0.43461916 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
14 0.43362364 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
15 0.41108033 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
16 0.41063747 288 nips-2011-Thinning Measurement Models and Questionnaire Design
17 0.40988928 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
18 0.40860617 179 nips-2011-Multilinear Subspace Regression: An Orthogonal Tensor Decomposition Approach
19 0.3919915 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities
20 0.38825074 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
topicId topicWeight
[(0, 0.015), (4, 0.035), (20, 0.026), (26, 0.034), (31, 0.097), (33, 0.011), (43, 0.061), (45, 0.107), (57, 0.044), (65, 0.016), (74, 0.34), (81, 0.043), (83, 0.055), (99, 0.032)]
simIndex simValue paperId paperTitle
1 0.98467565 218 nips-2011-Predicting Dynamic Difficulty
Author: Olana Missura, Thomas Gärtner
Abstract: Motivated by applications in electronic games as well as teaching systems, we investigate the problem of dynamic difficulty adjustment. The task here is to repeatedly find a game difficulty setting that is neither ‘too easy’ and bores the player, nor ‘too difficult’ and overburdens the player. The contributions of this paper are (i) the formulation of difficulty adjustment as an online learning problem on partially ordered sets, (ii) an exponential update algorithm for dynamic difficulty adjustment, (iii) a bound on the number of wrong difficulty settings relative to the best static setting chosen in hindsight, and (iv) an empirical investigation of the algorithm when playing against adversaries. 1
2 0.98108774 104 nips-2011-Generalized Beta Mixtures of Gaussians
Author: Artin Armagan, Merlise Clyde, David B. Dunson
Abstract: In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely. 1
3 0.97407556 259 nips-2011-Sparse Estimation with Structured Dictionaries
Author: David P. Wipf
Abstract: In the vast majority of recent work on sparse estimation algorithms, performance has been evaluated using ideal or quasi-ideal dictionaries (e.g., random Gaussian or Fourier) characterized by unit ℓ2 norm, incoherent columns or features. But in reality, these types of dictionaries represent only a subset of the dictionaries that are actually used in practice (largely restricted to idealized compressive sensing applications). In contrast, herein sparse estimation is considered in the context of structured dictionaries possibly exhibiting high coherence between arbitrary groups of columns and/or rows. Sparse penalized regression models are analyzed with the purpose of finding, to the extent possible, regimes of dictionary invariant performance. In particular, a Type II Bayesian estimator with a dictionarydependent sparsity penalty is shown to have a number of desirable invariance properties leading to provable advantages over more conventional penalties such as the ℓ1 norm, especially in areas where existing theoretical recovery guarantees no longer hold. This can translate into improved performance in applications such as model selection with correlated features, source localization, and compressive sensing with constrained measurement directions. 1
4 0.91523153 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies
Author: Viren Jain, Srinivas C. Turaga, K Briggman, Moritz N. Helmstaedter, Winfried Denk, H. S. Seung
Abstract: An agglomerative clustering algorithm merges the most similar pair of clusters at every iteration. The function that evaluates similarity is traditionally handdesigned, but there has been recent interest in supervised or semisupervised settings in which ground-truth clustered data is available for training. Here we show how to train a similarity function by regarding it as the action-value function of a reinforcement learning problem. We apply this general method to segment images by clustering superpixels, an application that we call Learning to Agglomerate Superpixel Hierarchies (LASH). When applied to a challenging dataset of brain images from serial electron microscopy, LASH dramatically improved segmentation accuracy when clustering supervoxels generated by state of the boundary detection algorithms. The naive strategy of directly training only supervoxel similarities and applying single linkage clustering produced less improvement. 1
same-paper 5 0.88701862 68 nips-2011-Demixed Principal Component Analysis
Author: Wieland Brendel, Ranulfo Romo, Christian K. Machens
Abstract: In many experiments, the data points collected live in high-dimensional observation spaces, yet can be assigned a set of labels or parameters. In electrophysiological recordings, for instance, the responses of populations of neurons generally depend on mixtures of experimentally controlled parameters. The heterogeneity and diversity of these parameter dependencies can make visualization and interpretation of such data extremely difficult. Standard dimensionality reduction techniques such as principal component analysis (PCA) can provide a succinct and complete description of the data, but the description is constructed independent of the relevant task variables and is often hard to interpret. Here, we start with the assumption that a particularly informative description is one that reveals the dependency of the high-dimensional data on the individual parameters. We show how to modify the loss function of PCA so that the principal components seek to capture both the maximum amount of variance about the data, while also depending on a minimum number of parameters. We call this method demixed principal component analysis (dPCA) as the principal components here segregate the parameter dependencies. We phrase the problem as a probabilistic graphical model, and present a fast Expectation-Maximization (EM) algorithm. We demonstrate the use of this algorithm for electrophysiological data and show that it serves to demix the parameter-dependence of a neural population response. 1
6 0.72494507 276 nips-2011-Structured sparse coding via lateral inhibition
7 0.71043485 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
8 0.70090145 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games
9 0.688959 158 nips-2011-Learning unbelievable probabilities
10 0.68678266 285 nips-2011-The Kernel Beta Process
11 0.68611491 258 nips-2011-Sparse Bayesian Multi-Task Learning
12 0.68363327 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)
13 0.68085819 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
14 0.67344642 265 nips-2011-Sparse recovery by thresholded non-negative least squares
15 0.67088598 186 nips-2011-Noise Thresholds for Spectral Clustering
16 0.66254508 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
17 0.66163367 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
18 0.65993398 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
19 0.6566031 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks
20 0.65425217 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression