nips nips2012 nips2012-365 knowledge-graph by maker-knowledge-mining

365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding

Source: pdf

Author: Philip Sterne, Joerg Bornschein, Abdul-saboor Sheikh, Joerg Luecke, Jacquelyn A. Shelton

Abstract: Modelling natural images with sparse coding (SC) has faced two main challenges: ﬂexibly representing varying pixel intensities and realistically representing lowlevel image components. This paper proposes a novel multiple-cause generative model of low-level image statistics that generalizes the standard SC model in two crucial points: (1) it uses a spike-and-slab prior distribution for a more realistic representation of component absence/intensity, and (2) the model uses the highly nonlinear combination rule of maximal causes analysis (MCA) instead of a linear combination. The major challenge is parameter optimization because a model with either (1) or (2) results in strongly multimodal posteriors. We show for the ﬁrst time that a model combining both improvements can be trained efﬁciently while retaining the rich structure of the posteriors. We design an exact piecewise Gibbs sampling method and combine this with a variational method based on preselection of latent dimensions. This combined training scheme tackles both analytical and computational intractability and enables application of the model to a large number of observed and hidden dimensions. Applying the model to image patches we study the optimal encoding of images by simple cells in V1 and compare the model’s predictions with in vivo neural recordings. In contrast to standard SC, we ﬁnd that the optimal prior favors asymmetric and bimodal activity of simple cells. Testing our model for consistency we ﬁnd that the average posterior is approximately equal to the prior. Furthermore, we ﬁnd that the model predicts a high percentage of globular receptive ﬁelds alongside Gabor-like ﬁelds. Similarly high percentages are observed in vivo. Our results thus argue in favor of improvements of the standard sparse coding model for simple cells by using ﬂexible priors and nonlinear combinations. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Nonlinear sparse coding with spike-andslab prior for neurally plausible image encoding Jacquelyn A. [sent-2, score-0.494]

2 de Abstract Modelling natural images with sparse coding (SC) has faced two main challenges: ﬂexibly representing varying pixel intensities and realistically representing lowlevel image components. [sent-8, score-0.379]

3 We design an exact piecewise Gibbs sampling method and combine this with a variational method based on preselection of latent dimensions. [sent-12, score-0.271]

4 Applying the model to image patches we study the optimal encoding of images by simple cells in V1 and compare the model’s predictions with in vivo neural recordings. [sent-14, score-0.312]

5 In contrast to standard SC, we ﬁnd that the optimal prior favors asymmetric and bimodal activity of simple cells. [sent-15, score-0.223]

6 Furthermore, we ﬁnd that the model predicts a high percentage of globular receptive ﬁelds alongside Gabor-like ﬁelds. [sent-17, score-0.329]

7 Our results thus argue in favor of improvements of the standard sparse coding model for simple cells by using ﬂexible priors and nonlinear combinations. [sent-19, score-0.357]

8 It was ﬁrst introduced as a model for the encoding of visual data in the primary visual cortex of mammals [1] and became the standard model to describe coding in simple cells. [sent-21, score-0.311]

9 More formally, sparse coding assumes that each observation y = (y1 , . [sent-23, score-0.262]

10 , yD ) is associated with a (continuous or discrete) sparse latent variable s = (s1 , . [sent-26, score-0.144]

11 , sH ), where sparsity implies that most of the components sh in s are zero or close-to zero. [sent-29, score-0.758]

12 Typ� ically, p(y | s, Θ) is modelled as a Gaussian with a mean µ deﬁned as µ = h sh Wh , i. [sent-31, score-0.679]

13 The sparse coding generative model has remained essentially the same since its introduction, with most work focusing on efﬁcient inference of optimal model parameters Θ (e. [sent-35, score-0.355]

14 First, it has been pointed out that visual components – such as edges – are either present or absent and this is poorly modelled with a Laplace prior because it lacks exact zeros. [sent-39, score-0.207]

15 Second, it has been pointed out that image components do not linearly superimpose to generate images, contrary to the standard sparse coding assumption. [sent-43, score-0.392]

16 Alternatively, various nonlinear combinations of visual components have been investigated [9, 10, 11, 12]. [sent-44, score-0.148]

17 Either modiﬁcation (spike-and-slab prior or nonlinearities) leads to multimodal posteriors, making parameter optimization difﬁcult. [sent-45, score-0.175]

18 For linear sparse coding with a spike-and-slab prior the challenge for learning has been overcome by applying factored variational EM approaches [13, 5] or sampling [6]. [sent-47, score-0.431]

19 Similarly, models with nonlinear superposition of components could be efﬁciently trained by applying a truncated variational EM approach [14, 12], but avoiding the analytical intractability introduced by using a continuous prior distribution. [sent-48, score-0.406]

20 In this work we propose a sparse coding model that for the ﬁrst time combines both of these improvements – a spike-and-slab distribution and nonlinear combination of components – in order to form a more realistic model of images. [sent-49, score-0.36]

21 We address the optimization of our model by using a combined approximate inference approach with preselection of latents (for truncated variational EM [14]) in combination with Gibbs sampling [15]. [sent-50, score-0.379]

22 Second, using natural image patches we show the model yields results consistent with in vivo recordings and that the model passes a consistency check which standard SC does not. [sent-52, score-0.235]

23 The columns of the matrix W = (Wdh ) are the generative ﬁelds, Wh , one associated with each latent variable sh . [sent-58, score-0.8]

24 We will be interested in working with the posterior over the latents given by p(y|s, θ) p(s|θ) p(s|y, θ) = � . [sent-60, score-0.187]

25 (5) � � � � p(y|s , θ) p(s |θ) ds s 2 A C generative ﬁelds B standard SC D spike-and-slab SC (linear) sum spike-and-slab SC (non-linear) max Figure 1: Generation according to different sparse coding generative models using the same generative ﬁelds. [sent-61, score-0.48]

26 B Examples of patches generated according to three generative models all using the ﬁelds in A. [sent-63, score-0.146]

27 Top row: standard linear sparse coding with Laplace prior. [sent-64, score-0.262]

28 C A natural image with two patches highlighted (magniﬁcations show their preprocessed from). [sent-69, score-0.194]

29 D Linear and nonlinear superposition of two single components for comparison with the actual superposition in C. [sent-70, score-0.24]

30 As in standard sparse coding, the model assumes independent latents and given the latent variables, the observations are distributed according to a Gaussian distribution. [sent-71, score-0.258]

31 Unlike standard sparse coding, the latent variables are not distributed according to a Laplace prior and the generative ﬁelds (or basis functions) are not combined linearly. [sent-72, score-0.365]

32 1 illustrates the model differences between a Laplace prior and a spike-and-slab prior and the differences between linear and nonlinear superposition. [sent-74, score-0.277]

33 As can be observed, standard sparse coding results in strong interference when basis functions overlap. [sent-75, score-0.324]

34 For spike-and-slab sparse coding most components are exactly zero but interference between them remains strong because of their linear superposition. [sent-76, score-0.346]

35 Combining a spike-and-slab prior with nonlinear composition allows minimal interference between the bases and ensures that latents can be exactly zero, which creates very multimodal posteriors since data must be explained by either one cause or another. [sent-77, score-0.433]

36 Linear and nonlinear superposition of two basis functions resembling single components is shown in Fig. [sent-80, score-0.192]

37 In this paper we use expectation maximization (EM) to estimate the model parameters Θ, and we use sampling after latent preselection [15] to represent the posterior distribution over the latent space. [sent-83, score-0.343]

38 As an example we obtain the following formula for the estimate of image noise: � �2 � � 1 �� (n) , (6) max Whd sk − yd σ2 = ˆ h h N DK n d k where we average over all N observed data points, D observed dimensions, and K Gibbs samples. [sent-87, score-0.308]

39 As such we will use the following notation: �∗ � (n) σ 2 = Wdh sh − yd ˆ , (7) where we maximize for h and average over n and d. [sent-89, score-0.902]

40 As discussed however, the posterior distribution of a model with a spike-and-slab prior in both the linear and nonlinear cases is strongly multimodal and such posteriors are difﬁcult to infer and represent. [sent-94, score-0.347]

41 Speciﬁcally we do exact Gibbs sampling from the posterior after we have preselected the most relevant set of latent states using a truncated variational form of EM. [sent-100, score-0.214]

42 As such, we will ﬁrst descibe the sampling step and preselection only later. [sent-102, score-0.154]

43 Previous work has used Gibbs sampling in combination with spike-and-slab models [17], and for increased efﬁciency in sparse Bayesian inference [18]. [sent-105, score-0.141]

44 C Log prior, which consists of an overall gaussian and the Dirac-peak at sh = 0. [sent-113, score-0.679]

45 D Log posterior, the sum of functions A, B, and C consists of D + 1 pieces plus the Dirac-peak at sh = 0. [sent-114, score-0.734]

46 F CDF for sh from which we do inverse transform sampling. [sent-116, score-0.679]

47 13 as the left piece of the function when sh < Pd and right piece when sh ≥ Pd . [sent-118, score-1.42]

48 sh because the data is explained by another cause when sh < Pd , and the right is a truncated Gaussian when considered a PDF of sh (see Fig. [sent-122, score-2.091]

49 The Gaussian slab of the prior is taken into account by adding its 2nd degree polynomial to all the pieces mi (sh ), which also ensures that every piece is a Gaussian. [sent-130, score-0.304]

50 Next, the Bernoulli component of the prior is accounted for by introducing the appropriate step into the CDF at sh = 0 (see Fig. [sent-132, score-0.791]

51 Once the CDF is constructed, we simulate each sh from the exact conditional distribution (sh ∼ p(sh |s\h = s\h , y, θ)) by inverse transform sampling. [sent-134, score-0.679]

52 We deﬁne Kn as Kn = {s | for all h �∈ I : sh = 0} where I contains the indices of the latents estimated to be most relevant for y (n) . [sent-143, score-0.793]

53 To obtain these latent indices we use a selection function of the form: � �2 � � � (17) Sh (y (n) ) = �Wh − y (n) �2 �Wh �2 to select the H � < H highest scoring latents for I. [sent-144, score-0.172]

54 Second, we apply our model to natural image patches and compare with in vivo recording from various sources. [sent-163, score-0.235]

55 We applied our model to N = 50, 000 image patches of 16 × 16 pixels. [sent-180, score-0.168]

56 The patches were extracted from the Van Hateren natural image database [20] and subsequently preprocessed using pseudo-whitening [1]. [sent-181, score-0.194]

57 We split the image patches into a positive and negative ˜ ˜ channel to ensure yd ≥ 0: each image patch y of size D = 16 × 16 is converted into a datapoint ˜ y y of size D = 2 D by assigning yd = [˜d ]+ and yD+d = [−˜d ]+ , where [x]+ = x for x > 0 and ˜ + [x] = 0 otherwise. [sent-182, score-0.73]

58 In a ﬁnal step, as a form of local contrast normalization, we scaled each image patch so that 0 ≤ yd ≤ 10. [sent-184, score-0.308]

59 2, which means that an average of roughly six latent variables were active in every image patch. [sent-194, score-0.143]

60 8 Figure 5: Results after training our model on N = 50, 000 image patches of size 16 × 16 using H=500 latent units. [sent-250, score-0.226]

61 The fraction of globular ﬁelds measured in vivo are shown for comparison. [sent-254, score-0.309]

62 D Visualization of the prior inferred by our model: On average πH = 6. [sent-260, score-0.153]

63 The bimodal pattern closely resembles the prior activation inferred in D. [sent-263, score-0.264]

64 perform reverse correlation on the learned generative ﬁelds and ﬁt the resulting estimated receptive ﬁelds with Gabor wavelets and DoGs (see Supp. [sent-264, score-0.18]

65 Notably, the proportion of globular ﬁelds predicted by the model (Fig. [sent-269, score-0.242]

66 5D-E compares the optimal prior distribution with the average posterior distribution for several latent variables (with their associated generative ﬁelds shown in insets). [sent-272, score-0.306]

67 (18) N →∞ Our model satisﬁes this condition; the average posterior over these ﬁelds closely resembles the optimal prior, which is a test standard sparse coding fails (see [17] for a discussion). [sent-276, score-0.335]

68 We also apply our model to the task of image inpainting and image denoising. [sent-278, score-0.17]

69 5 Discussion In this work, we deﬁned and studied a sparse coding model that, for the ﬁrst time, combines a spikeand-slab prior with a nonlinear combination of dictionary elements. [sent-283, score-0.474]

70 To address the optimization of our model, we designed an exact piecewise Gibbs sampling method combined with a variational method based on preselection of latent dimensions. [sent-284, score-0.294]

71 The learning algorithm derived for the model enables the efﬁcient inference of all model parameters including sparsity and prior parameters. [sent-286, score-0.176]

72 The spike-and-slab prior used in this study can parameterize prior distributions which are symmetric and unimodal (spike on top of the Guassian) as well as strongly bimodal distributions with the Gaussian mean being signiﬁcantly different from zero. [sent-287, score-0.357]

73 However, inferring the correct prior distribution requires sophisticated inference and learning schemes. [sent-288, score-0.142]

74 Standard sparse coding with MAP-based approximation only optimizes the basis functions [25, 4]. [sent-289, score-0.285]

75 Namely, the prior shape remains ﬁxed except for its weighting factor (the regularization parameter) which is typically only inferred indirectly (if at all) using cross-validation. [sent-290, score-0.153]

76 Very few sparse coding approaches infer prior parameters directly. [sent-291, score-0.374]

77 The MoG prior can model multimodality but in numerical experiments on image patches the mixture components were observed to converge to a monomodal prior – which may be caused by the assumed linear superposition or by the Gibbs sampler not mixing sufﬁciently. [sent-293, score-0.531]

78 When the MoG prior was ﬁxed to be trimodal, no instructive generative ﬁelds were observed [17]. [sent-294, score-0.175]

79 Another example of sparse coding with prior inference is a more recent approach which uses a parameterized student-t distribution as prior and applies sampling to infer the sparsity [26]. [sent-295, score-0.575]

80 The work in [27] uses a trimodal prior for image patches but shape and sparsity remain ﬁxed, i. [sent-297, score-0.346]

81 In contrast, we have shown in this study that the prior shape and sparsity level can be inferred from image data. [sent-300, score-0.272]

82 The resulting prior is strongly bimodal and control experiments conﬁrm a high consistency of the prior with the average posterior (Fig. [sent-301, score-0.43]

83 Standard sparse coding approaches typically fail in such controls which may be taken as early evidence for bimodal or multimodal priors being more optimal (see [17]). [sent-303, score-0.436]

84 Together with a bimodal prior, our model infers Gabor and difference-of-Gaussian (DoG) functions as the optimal basis functions for the used image patches. [sent-304, score-0.219]

85 While Gabors are the standard outcome of sparse coding, DoGs have not been predicted by sparse coding until very recently. [sent-305, score-0.348]

86 A number of studies have since shown that globular ﬁelds can emerge in applications of computational models to image patches [28, 27, 29, 30, 31, 12, 32]. [sent-307, score-0.436]

87 One study [29] has shown that globular ﬁelds can be obtained with standard sparse coding by choosing speciﬁc values for overcompleteness and sparsity (i. [sent-308, score-0.563]

88 prior shape and sparsity are not inferred from data). [sent-310, score-0.187]

89 The studies [27, 31, 32] assume a restricted set of values for latent variables and yield relatively high proportion of globular ﬁelds suggesting that the emergence of globular ﬁelds is due to hard constraints on the latents. [sent-311, score-0.594]

90 On the other hand, the studies [28, 30, 12] suggest that globular ﬁelds are a consequence of occlusion nonlinearities. [sent-312, score-0.29]

91 Our study argues in favor of the occlusion interpretation for the emergence of globular ﬁelds because the model studied here shows that high percentages of globular ﬁelds emerge with a prior that is (a) inferred from data and (b) allows for a continuous distribution of latent values. [sent-313, score-0.782]

92 In summary, the main results obtained by applying the novel model to preprocessed images are: (1) the observation that a bimodal prior is preferred over a unimodal one for optimal image coding, and (2) that high percentages of globular ﬁelds are predicted. [sent-314, score-0.615]

93 The sparse bimodal prior is consistent with sparse and positive neural activtiy for the encoding of image components in V1, and the high percentage of globular ﬁelds is consistent with recent in vivo recordings of simple cells. [sent-315, score-0.869]

94 Emergence of simple-cell receptive ﬁeld properties by learning a sparse code for natural images. [sent-323, score-0.173]

95 Spike and slab variational inference for multi-task and multiple kernel a learning. [sent-346, score-0.141]

96 Closed-form EM for sparse coding and its application to source o u separation. [sent-361, score-0.262]

97 Expectation truncation and the beneﬁts of preselection in training generative o u models. [sent-394, score-0.192]

98 Sparse coding with an overcomplete basis set: A strategy employed by V1? [sent-464, score-0.199]

99 A network that uses few active neurones to code visual input predicts the diverse shapes of cortical receptive ﬁelds. [sent-475, score-0.214]

100 Non-parametric Bayesian dictionary learning for sparse image representations 1. [sent-518, score-0.218]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sh', 0.679), ('globular', 0.242), ('yd', 0.223), ('wdh', 0.21), ('coding', 0.176), ('elds', 0.159), ('preselection', 0.129), ('latents', 0.114), ('prior', 0.112), ('bimodal', 0.111), ('pd', 0.106), ('pr', 0.106), ('frankfurt', 0.097), ('wh', 0.09), ('receptive', 0.087), ('sparse', 0.086), ('image', 0.085), ('kn', 0.084), ('sc', 0.084), ('patches', 0.083), ('slab', 0.079), ('bornschein', 0.079), ('gibbs', 0.074), ('posterior', 0.073), ('superposition', 0.071), ('vivo', 0.067), ('dog', 0.067), ('emsteps', 0.065), ('generative', 0.063), ('multimodal', 0.063), ('latent', 0.058), ('bh', 0.056), ('pieces', 0.055), ('gabor', 0.055), ('nonlinear', 0.053), ('em', 0.053), ('cdf', 0.051), ('visual', 0.05), ('cke', 0.048), ('mog', 0.048), ('zh', 0.048), ('dictionary', 0.047), ('components', 0.045), ('intractability', 0.045), ('rg', 0.043), ('cells', 0.042), ('inferred', 0.041), ('sheikh', 0.039), ('percentages', 0.039), ('interference', 0.039), ('laplace', 0.037), ('lncs', 0.037), ('olshausen', 0.036), ('spike', 0.036), ('gt', 0.036), ('shelton', 0.035), ('encoding', 0.035), ('sparsity', 0.034), ('variational', 0.032), ('mca', 0.032), ('niell', 0.032), ('realistically', 0.032), ('ringach', 0.032), ('sterne', 0.032), ('trimodal', 0.032), ('whd', 0.032), ('datapoint', 0.031), ('berkes', 0.031), ('piece', 0.031), ('wavelets', 0.03), ('inference', 0.03), ('ds', 0.029), ('usrey', 0.029), ('gabors', 0.029), ('puertas', 0.029), ('cortical', 0.028), ('cause', 0.028), ('mi', 0.027), ('piecewise', 0.027), ('truncated', 0.026), ('neurones', 0.026), ('preprocessed', 0.026), ('emergence', 0.026), ('studies', 0.026), ('sampling', 0.025), ('overcompleteness', 0.025), ('goodfellow', 0.025), ('hateren', 0.025), ('posteriors', 0.024), ('multimodality', 0.023), ('courville', 0.023), ('nx', 0.023), ('tackles', 0.023), ('dogs', 0.023), ('basis', 0.023), ('shapes', 0.023), ('combined', 0.023), ('analytical', 0.022), ('strongly', 0.022), ('occlusion', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding

Author: Philip Sterne, Joerg Bornschein, Abdul-saboor Sheikh, Joerg Luecke, Jacquelyn A. Shelton

2 0.14439751 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams

Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that deﬁnes probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efﬁcient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classiﬁcation, learning to rank, and top-K classiﬁcation. 1

3 0.1130484 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik

Abstract: This paper focuses on the problem of learning binary codes for efﬁcient retrieval of high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features. The similarity of such feature vectors is commonly measured using the cosine of the angle between them. In this work, we introduce a novel angular quantization-based binary coding (AQBC) technique for such data and analyze its properties. In its most basic form, AQBC works by mapping each non-negative feature vector onto the vertex of the binary hypercube with which it has the smallest angle. Even though the number of vertices (quantization landmarks) in this scheme grows exponentially with data dimensionality d, we propose a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d). Further, we propose a method for learning a linear transformation of the data to minimize the quantization error, and show that it results in improved binary codes. Experiments on image and text datasets show that the proposed AQBC method outperforms the state of the art. 1

4 0.10468809 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

Author: Lucas Theis, Jascha Sohl-dickstein, Matthias Bethge

Abstract: We present a new learning strategy based on an efﬁcient blocked Gibbs sampler for sparse overcomplete linear models. Particular emphasis is placed on statistical image modeling, where overcomplete models have played an important role in discovering sparse representations. Our Gibbs sampler is faster than general purpose sampling schemes while also requiring no tuning as it is free of parameters. Using the Gibbs sampler and a persistent variant of expectation maximization, we are able to extract highly sparse distributions over latent sources from data. When applied to natural images, our algorithm learns source distributions which resemble spike-and-slab distributions. We evaluate the likelihood and quantitatively compare the performance of the overcomplete linear model to its complete counterpart as well as a product of experts model, which represents another overcomplete generalization of the complete linear model. In contrast to previous claims, we ﬁnd that overcomplete representations lead to signiﬁcant improvements, but that the overcomplete linear model still underperforms other models. 1

5 0.095145583 195 nips-2012-Learning visual motion in recurrent neural networks

Author: Marius Pachitariu, Maneesh Sahani

Abstract: We present a dynamic nonlinear generative model for visual motion based on a latent representation of binary-gated Gaussian variables. Trained on sequences of images, the model learns to represent different movement directions in different variables. We use an online approximate inference scheme that can be mapped to the dynamics of networks of neurons. Probed with drifting grating stimuli and moving bars of light, neurons in the model show patterns of responses analogous to those of direction-selective simple cells in primary visual cortex. Most model neurons also show speed tuning and respond equally well to a range of motion directions and speeds aligned to the constraint line of their respective preferred speed. We show how these computations are enabled by a speciﬁc pattern of recurrent connections learned by the model. 1

6 0.075555004 23 nips-2012-A lattice filter model of the visual pathway

7 0.074589603 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

8 0.074387617 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

9 0.074364699 163 nips-2012-Isotropic Hashing

10 0.073177129 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

11 0.072607711 62 nips-2012-Burn-in, bias, and the rationality of anchoring

12 0.072607711 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning

13 0.071899027 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

14 0.070640959 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

15 0.070575885 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

16 0.067561232 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

17 0.067036919 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

18 0.063218735 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

19 0.058473043 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

20 0.056103017 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.161), (1, 0.051), (2, -0.108), (3, 0.061), (4, -0.058), (5, 0.024), (6, 0.003), (7, -0.006), (8, 0.07), (9, -0.017), (10, 0.004), (11, -0.012), (12, -0.037), (13, 0.043), (14, -0.03), (15, -0.027), (16, 0.022), (17, -0.03), (18, -0.038), (19, -0.019), (20, 0.026), (21, 0.062), (22, -0.027), (23, 0.061), (24, 0.008), (25, 0.082), (26, -0.009), (27, -0.005), (28, -0.02), (29, 0.076), (30, -0.082), (31, -0.034), (32, 0.035), (33, -0.075), (34, 0.03), (35, -0.067), (36, -0.014), (37, -0.065), (38, 0.082), (39, -0.047), (40, -0.0), (41, 0.062), (42, 0.057), (43, -0.086), (44, -0.06), (45, 0.114), (46, -0.048), (47, -0.04), (48, 0.015), (49, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93190485 365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding

Author: Philip Sterne, Joerg Bornschein, Abdul-saboor Sheikh, Joerg Luecke, Jacquelyn A. Shelton

2 0.66355866 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

Author: Daniel Zoran, Yair Weiss

Abstract: Simple Gaussian Mixture Models (GMMs) learned from pixels of natural image patches have been recently shown to be surprisingly strong performers in modeling the statistics of natural images. Here we provide an in depth analysis of this simple yet rich model. We show that such a GMM model is able to compete with even the most successful models of natural images in log likelihood scores, denoising performance and sample quality. We provide an analysis of what such a model learns from natural images as a function of number of mixture components including covariance structure, contrast variation and intricate structures such as textures, boundaries and more. Finally, we show that the salient properties of the GMM learned from natural images can be derived from a simplified Dead Leaves model which explicitly models occlusion, explaining its surprising success relative to other models. 1 GMMs and natural image statistics models Many models for the statistics of natural image patches have been suggested in recent years. Finding good models for natural images is important to many different research areas - computer vision, biological vision and neuroscience among others. Recently, there has been a growing interest in comparing different aspects of models for natural images such as log-likelihood and multi-information reduction performance, and much progress has been achieved [1,2, 3,4,5, 6]. Out of these results there is one which is particularly interesting: simple, unconstrained Gaussian Mixture Models (GMMs) with a relatively small number of mixture components learned from image patches are extraordinarily good in modeling image statistics [6, 4]. This is a surprising result due to the simplicity of GMMs and their ubiquity. Another surprising aspect of this result is that many of the current models may be thought of as GMMs with an exponential or infinite number of components, having different constraints on the covariance structure of the mixture components. In this work we study the nature of GMMs learned from natural image patches. We start with a thorough comparison to some popular and cutting edge image models. We show that indeed, GMMs are excellent performers in modeling natural image patches. We then analyze what properties of natural images these GMMs capture, their dependence on the number of components in the mixture and their relation to the structure of the world around us. Finally, we show that the learned GMM suggests a strong connection between natural image statistics and a simple variant of the dead leaves model [7, 8] , explicitly modeling occlusions and explaining some of the success of GMMs in modeling natural images. 1 3.5 .,...- ••.......-.-.. -..---'-. 1 ~~6\8161·· -.. .-.. --...--.-- ---..-.- -. --------------MII+··+ilIl ..... .. . . ~ '[25 . . . ---- ] B'II 1_ -- ~2 ;t:: fI 1 - --- ,---- ._.. : 61.5 ..... '

3 0.62987727 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

Author: Lucas Theis, Jascha Sohl-dickstein, Matthias Bethge

4 0.60730112 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

Author: Hiroki Terashima, Masato Okada

Abstract: The computational modelling of the primary auditory cortex (A1) has been less fruitful than that of the primary visual cortex (V1) due to the less organized properties of A1. Greater disorder has recently been demonstrated for the tonotopy of A1 that has traditionally been considered to be as ordered as the retinotopy of V1. This disorder appears to be incongruous, given the uniformity of the neocortex; however, we hypothesized that both A1 and V1 would adopt an efﬁcient coding strategy and that the disorder in A1 reﬂects natural sound statistics. To provide a computational model of the tonotopic disorder in A1, we used a model that was originally proposed for the smooth V1 map. In contrast to natural images, natural sounds exhibit distant correlations, which were learned and reﬂected in the disordered map. The auditory model predicted harmonic relationships among neighbouring A1 cells; furthermore, the same mechanism used to model V1 complex cells reproduced nonlinear responses similar to the pitch selectivity. These results contribute to the understanding of the sensory cortices of different modalities in a novel and integrated manner.

5 0.60052216 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

Author: Xue-xin Wei, Alan Stocker

Abstract: A common challenge for Bayesian models of perception is the fact that the two fundamental Bayesian components, the prior distribution and the likelihood function, are formally unconstrained. Here we argue that a neural system that emulates Bayesian inference is naturally constrained by the way it represents sensory information in populations of neurons. More speciﬁcally, we show that an efﬁcient coding principle creates a direct link between prior and likelihood based on the underlying stimulus distribution. The resulting Bayesian estimates can show biases away from the peaks of the prior distribution, a behavior seemingly at odds with the traditional view of Bayesian estimation, yet one that has been reported in human perception. We demonstrate that our framework correctly accounts for the repulsive biases previously reported for the perception of visual orientation, and show that the predicted tuning characteristics of the model neurons match the reported orientation tuning properties of neurons in primary visual cortex. Our results suggest that efﬁcient coding is a promising hypothesis in constraining Bayesian models of perceptual inference. 1 Motivation Human perception is not perfect. Biases have been observed in a large number of perceptual tasks and modalities, of which the most salient ones constitute many well-known perceptual illusions. It has been suggested, however, that these biases do not reﬂect a failure of perception but rather an observer’s attempt to optimally combine the inherently noisy and ambiguous sensory information with appropriate prior knowledge about the world [13, 4, 14]. This hypothesis, which we will refer to as the Bayesian hypothesis, has indeed proven quite successful in providing a normative explanation of perception at a qualitative and, more recently, quantitative level (see e.g. [15]). A major challenge in forming models based on the Bayesian hypothesis is the correct selection of two main components: the prior distribution (belief) and the likelihood function. This has encouraged some to criticize the Bayesian hypothesis altogether, claiming that arbitrary choices for these components always allow for unjustiﬁed post-hoc explanations of the data [1]. We do not share this criticism, referring to a number of successful attempts to constrain prior beliefs and likelihood functions based on principled grounds. For example, prior beliefs have been deﬁned as the relative distribution of the sensory variable in the environment in cases where these statistics are relatively easy to measure (e.g. local visual orientations [16]), or where it can be assumed that subjects have learned them over the course of the experiment (e.g. time perception [17]). Other studies have constrained the likelihood function according to known noise characteristics of neurons that are crucially involved in the speciﬁc perceptual process (e.g motion tuned neurons in visual cor∗ http://www.sas.upenn.edu/ astocker/lab 1 world neural representation efficient encoding percept Bayesian decoding Figure 1: Encoding-decoding framework. A stimulus representing a sensory variable θ elicits a ﬁring rate response R = {r1 , r2 , ..., rN } in a population of N neurons. The perceptual task is to generate a ˆ good estimate θ(R) of the presented value of the sensory variable based on this population response. Our framework assumes that encoding is efﬁcient, and decoding is Bayesian based on the likelihood p(R|θ), the prior p(θ), and a squared-error loss function. tex [18]). However, we agree that ﬁnding appropriate constraints is generally difﬁcult and that prior beliefs and likelihood functions have been often selected on the basis of mathematical convenience. Here, we propose that the efﬁcient coding hypothesis [19] offers a joint constraint on the prior and likelihood function in neural implementations of Bayesian inference. Efﬁcient coding provides a normative description of how neurons encode sensory information, and suggests a direct link between measured perceptual discriminability, neural tuning characteristics, and environmental statistics [11]. We show how this link can be extended to a full Bayesian account of perception that includes perceptual biases. We validate our model framework against behavioral as well as neural data characterizing the perception of visual orientation. We demonstrate that we can account not only for the reported perceptual biases away from the cardinal orientations, but also for the speciﬁc response characteristics of orientation-tuned neurons in primary visual cortex. Our work is a novel proposal of how two important normative hypotheses in perception science, namely efﬁcient (en)coding and Bayesian decoding, might be linked. 2 Encoding-decoding framework We consider perception as an inference process that takes place along the simpliﬁed neural encodingdecoding cascade illustrated in Fig. 11 . 2.1 Efﬁcient encoding Efﬁcient encoding proposes that the tuning characteristics of a neural population are adapted to the prior distribution p(θ) of the sensory variable such that the population optimally represents the sensory variable [19]. Different deﬁnitions of “optimally” are possible, and may lead to different results. Here, we assume an efﬁcient representation that maximizes the mutual information between the sensory variable and the population response. With this deﬁnition and an upper limit on the total ﬁring activity, the square-root of the Fisher Information must be proportional to the prior distribution [12, 21]. In order to constrain the tuning curves of individual neurons in the population we also impose a homogeneity constraint, requiring that there exists a one-to-one mapping F (θ) that transforms the ˜ physical space with units θ to a homogeneous space with units θ = F (θ) in which the stimulus distribution becomes uniform. This deﬁnes the mapping as θ F (θ) = p(χ)dχ , (1) −∞ which is the cumulative of the prior distribution p(θ). We then assume a neural population with identical tuning curves that evenly tiles the stimulus range in this homogeneous space. The population provides an efﬁcient representation of the sensory variable θ according to the above constraints [11]. ˜ The tuning curves in the physical space are obtained by applying the inverse mapping F −1 (θ). Fig. 2 1 In the context of this paper, we consider ‘inferring’, ‘decoding’, and ‘estimating’ as synonymous. 2 stimulus distribution d samples # a Fisher information discriminability and average firing rates and b firing rate [ Hz] efficient encoding F likelihood function F -1 likelihood c symmetric asymmetric homogeneous space physical space Figure 2: Efﬁcient encoding constrains the likelihood function. a) Prior distribution p(θ) derived from stimulus statistics. b) Efﬁcient coding deﬁnes the shape of the tuning curves in the physical space by transforming a set of homogeneous neurons using a mapping F −1 that is the inverse of the cumulative of the prior p(θ) (see Eq. (1)). c) As a result, the likelihood shape is constrained by the prior distribution showing heavier tails on the side of lower prior density. d) Fisher information, discrimination threshold, and average ﬁring rates are all uniform in the homogeneous space. illustrates the applied efﬁcient encoding scheme, the mapping, and the concept of the homogeneous space for the example of a symmetric, exponentially decaying prior distribution p(θ). The key idea here is that by assuming efﬁcient encoding, the prior (i.e. the stimulus distribution in the world) directly constrains the likelihood function. In particular, the shape of the likelihood is determined by the cumulative distribution of the prior. As a result, the likelihood is generally asymmetric, as shown in Fig. 2, exhibiting heavier tails on the side of the prior with lower density. 2.2 Bayesian decoding Let us consider a population of N sensory neurons that efﬁciently represents a stimulus variable θ as described above. A stimulus θ0 elicits a speciﬁc population response that is characterized by the vector R = [r1 , r2 , ..., rN ] where ri is the spike-count of the ith neuron over a given time-window τ . Under the assumption that the variability in the individual ﬁring rates is governed by a Poisson process, we can write the likelihood function over θ as N p(R|θ) = (τ fi (θ))ri −τ fi (θ) e , ri ! i=1 (2) ˆ with fi (θ) describing the tuning curve of neuron i. We then deﬁne a Bayesian decoder θLSE as the estimator that minimizes the expected squared-error between the estimate and the true stimulus value, thus θp(R|θ)p(θ)dθ ˆ θLSE (R) = , (3) p(R|θ)p(θ)dθ where we use Bayes’ rule to appropriately combine the sensory evidence with the stimulus prior p(θ). 3 Bayesian estimates can be biased away from prior peaks Bayesian models of perception typically predict perceptual biases toward the peaks of the prior density, a characteristic often considered a hallmark of Bayesian inference. This originates from the 3 a b prior attraction prior prior attraction likelihood repulsion! likelihood c prior prior repulsive bias likelihood likelihood mean posterior mean posterior mean Figure 3: Bayesian estimates biased away from the prior. a) If the likelihood function is symmetric, then the estimate (posterior mean) is, on average, shifted away from the actual value of the sensory variable θ0 towards the prior peak. b) Efﬁcient encoding typically leads to an asymmetric likelihood function whose normalized mean is away from the peak of the prior (relative to θ0 ). The estimate is determined by a combination of prior attraction and shifted likelihood mean, and can exhibit an overall repulsive bias. c) If p(θ0 ) < 0 and the likelihood is relatively narrow, then (1/p(θ)2 ) > 0 (blue line) and the estimate is biased away from the prior peak (see Eq. (6)). common approach of choosing a parametric description of the likelihood function that is computationally convenient (e.g. Gaussian). As a consequence, likelihood functions are typically assumed to be symmetric (but see [23, 24]), leaving the bias of the Bayesian estimator to be mainly determined by the shape of the prior density, i.e. leading to biases toward the peak of the prior (Fig. 3a). In our model framework, the shape of the likelihood function is constrained by the stimulus prior via efﬁcient neural encoding, and is generally not symmetric for non-ﬂat priors. It has a heavier tail on the side with lower prior density (Fig. 3b). The intuition is that due to the efﬁcient allocation of neural resources, the side with smaller prior density will be encoded less accurately, leading to a broader likelihood function on that side. The likelihood asymmetry pulls the Bayes’ least-squares estimate away from the peak of the prior while at the same time the prior pulls it toward its peak. Thus, the resulting estimation bias is the combination of these two counter-acting forces - and both are determined by the prior! 3.1 General derivation of the estimation bias In the following, we will formally derive the mean estimation bias b(θ) of the proposed encodingdecoding framework. Speciﬁcally, we will study the conditions for which the bias is repulsive i.e. away from the peak of the prior density. ˆ We ﬁrst re-write the estimator θLSE (3) by replacing θ with the inverse of its mapping to the homo−1 ˜ geneous space, i.e., θ = F (θ). The motivation for this is that the likelihood in the homogeneous space is symmetric (Fig. 2). Given a value θ0 and the elicited population response R, we can write the estimator as ˜ ˜ ˜ ˜ θp(R|θ)p(θ)dθ F −1 (θ)p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) ˆ θLSE (R) = = . ˜ ˜ ˜ p(R|θ)p(θ)dθ p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) Calculating the derivative of the inverse function and noting that F is the cumulative of the prior density, we get 1 1 1 ˜ ˜ ˜ ˜ ˜ ˜ dθ = dθ. dF −1 (θ) = (F −1 (θ)) dθ = dθ = −1 (θ)) ˜ F (θ) p(θ) p(F ˆ Hence, we can simplify θLSE (R) as ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)p(R|F −1 (θ))dθ . ˜ ˜ p(R|F −1 (θ))dθ With ˜ K(R, θ) = ˜ p(R|F −1 (θ)) ˜ ˜ p(R|F −1 (θ))dθ 4 we can further simplify the notation and get ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)K(R, θ)dθ . (4) ˆ ˜ In order to get the expected value of the estimate, θLSE (θ), we marginalize (4) over the population response space S, ˆ ˜ ˜ ˜ ˜ θLSE (θ) = p(R)F −1 (θ)K(R, θ)dθdR S = F −1 ˜ (θ)( ˜ ˜ p(R)K(R, θ)dR)dθ = ˜ ˜ ˜ F −1 (θ)L(θ)dθ, S where we deﬁne ˜ L(θ) = ˜ p(R)K(R, θ)dR. S ˜ ˜ ˜ It follows that L(θ)dθ = 1. Due to the symmetry in this space, it can be shown that L(θ) is ˜0 . Intuitively, L(θ) can be thought as the normalized ˜ symmetric around the true stimulus value θ average likelihood in the homogeneous space. We can then compute the expected bias at θ0 as b(θ0 ) = ˜ ˜ ˜ ˜ F −1 (θ)L(θ)dθ − F −1 (θ0 ) (5) ˜ This is expression is general where F −1 (θ) is deﬁned as the inverse of the cumulative of an arbitrary ˜ prior density p(θ) (see Eq. (1)) and the dispersion of L(θ) is determined by the internal noise level. ˜ ˜ Assuming the prior density to be smooth, we expand F −1 in a neighborhood (θ0 − h, θ0 + h) that is larger than the support of the likelihood function. Using Taylor’s theorem with mean-value forms of the remainder, we get 1 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ F −1 (θ) = F −1 (θ0 ) + F −1 (θ0 ) (θ − θ0 ) + F −1 (θx ) (θ − θ0 )2 , 2 ˜ ˜ ˜ with θx lying between θ0 and θ. By applying this expression to (5), we ﬁnd ˜ θ0 +h b(θ0 ) = = 1 2 ˜ θ0 −h 1 −1 ˜ ˜ ˜ ˜ ˜ 1 F (θx )θ (θ − θ0 )2 L(θ)dθ = ˜ 2 2 ˜ θ0 +h −( ˜ θ0 −h p(θx )θ ˜ ˜ 2 ˜ ˜ 1 )(θ − θ0 ) L(θ)dθ = p(θx )3 4 ˜ θ0 +h 1 ˜ − θ0 )2 L(θ)dθ ˜ ˜ ˜ ( ) ˜(θ ˜ p(F −1 (θx )) θ ( 1 ˜ ˜ ˜ ˜ ) (θ − θ0 )2 L(θ)dθ. p(θx )2 θ ˜ θ0 −h ˜ θ0 +h ˜ θ0 −h In general, there is no simple rule to judge the sign of b(θ0 ). However, if the prior is monotonic ˜ ˜ on the interval F −1 ((θ0 − h, θ0 + h)), then the sign of ( p(θ1 )2 ) is always the same as the sign of x 1 1 ( p(θ0 )2 ) . Also, if the likelihood is sufﬁciently narrow we can approximate ( p(θ1 )2 ) by ( p(θ0 )2 ) , x and therefore approximate the bias as b(θ0 ) ≈ C( 1 ) , p(θ0 )2 (6) where C is a positive constant. The result is quite surprising because it states that as long as the prior is monotonic over the support of the likelihood function, the expected estimation bias is always away from the peaks of the prior! 3.2 Internal (neural) versus external (stimulus) noise The above derivation of estimation bias is based on the assumption that all uncertainty about the sensory variable is caused by neural response variability. This level of internal noise depends on the response magnitude, and thus can be modulated e.g. by changing stimulus contrast. This contrastcontrolled noise modulation is commonly exploited in perceptual studies (e.g. [18]). Internal noise will always lead to repulsive biases in our framework if the prior is monotonic. If internal noise is low, the likelihood is narrow and thus the bias is small. Increasing internal noise leads to increasingly 5 larger biases up to the point where the likelihood becomes wide enough such that monotonicity of the prior over the support of the likelihood is potentially violated. Stimulus noise is another way to modulate the noise level in perception (e.g. random-dot motion stimuli). Such external noise, however, has a different effect on the shape of the likelihood function as compared to internal noise. It modiﬁes the likelihood function (2) by convolving it with the noise kernel. External noise is frequently chosen as additive and symmetric (e.g. zero-mean Gaussian). It is straightforward to prove that such symmetric external noise does not lead to a change in the mean of the likelihood, and thus does not alter the repulsive effect induced by its asymmetry. However, by increasing the overall width of the likelihood, the attractive inﬂuence of the prior increases, resulting in an estimate that is closer to the prior peak than without external noise2 . 4 Perception of visual orientation We tested our framework by modelling the perception of visual orientation. Our choice was based on the fact that i) we have pretty good estimates of the prior distribution of local orientations in natural images, ii) tuning characteristics of orientation selective neurons in visual cortex are wellstudied (monkey/cat), and iii) biases in perceived stimulus orientation have been well characterized. We start by creating an efﬁcient neural population based on measured prior distributions of local visual orientation, and then compare the resulting tuning characteristics of the population and the predicted perceptual biases with reported data in the literature. 4.1 Efﬁcient neural model population for visual orientation Previous studies measured the statistics of the local orientation in large sets of natural images and consistently found that the orientation distribution is multimodal, peaking at the two cardinal orientations as shown in Fig. 4a [16, 20]. We assumed that the visual system’s prior belief over orientation p(θ) follows this distribution and approximate it formally as p(θ) ∝ 2 − | sin(θ)| (black line in Fig. 4b) . (7) Based on this prior distribution we deﬁned an efﬁcient neural representation for orientation. We assumed a population of model neurons (N = 30) with tuning curves that follow a von-Mises distribution in the homogeneous space on top of a constant spontaneous ﬁring rate (5 Hz). We then ˜ applied the inverse transformation F −1 (θ) to all these tuning curves to get the corresponding tuning curves in the physical space (Fig. 4b - red curves), where F (θ) is the cumulative of the prior (7). The concentration parameter for the von-Mises tuning curves was set to κ ≈ 1.6 in the homogeneous space in order to match the measured average tuning width (∼ 32 deg) of neurons in area V1 of the macaque [9]. 4.2 Predicted tuning characteristics of neurons in primary visual cortex The orientation tuning characteristics of our model population well match neurophysiological data of neurons in primary visual cortex (V1). Efﬁcient encoding predicts that the distribution of neurons’ preferred orientation follows the prior, with more neurons tuned to cardinal than oblique orientations by a factor of approximately 1.5. A similar ratio has been found for neurons in area V1 of monkey/cat [9, 10]. Also, the tuning widths of the model neurons vary between 25-42 deg depending on their preferred tuning (see Fig. 4c), matching the measured tuning width ratio of 0.6 between neurons tuned to the cardinal versus oblique orientations [9]. An important prediction of our model is that most of the tuning curves should be asymmetric. Such asymmetries have indeed been reported for the orientation tuning of neurons in area V1 [6, 7, 8]. We computed the asymmetry index for our model population as deﬁned in previous studies [6, 7], and plotted it as a function of the preferred tuning of each neuron (Fig. 4d). The overall asymmetry index in our model population is 1.24 ± 0.11, which approximately matches the measured values for neurons in area V1 of the cat (1.26 ± 0.06) [6]. It also predicts that neurons tuned to the cardinal and oblique orientations should show less symmetry than those tuned to orientations in between. Finally, 2 Note, that these predictions are likely to change if the external noise is not symmetric. 6 a b 25 firing rate(Hz) 0 orientation(deg) asymmetry vs. tuning width 1.0 2.0 90 2.0 e asymmetry 1.0 0 asymmetry index 50 30 width (deg) 10 90 preferred tuning(deg) -90 0 d 0 0 90 asymmetry index 0 orientation(deg) tuning width -90 0 0 probability 0 -90 c efficient representation 0.01 0.01 image statistics -90 0 90 preferred tuning(deg) 25 30 35 40 tuning width (deg) Figure 4: Tuning characteristics of model neurons. a) Distribution of local orientations in natural images, replotted from [16]. b) Prior used in the model (black) and predicted tuning curves according to efﬁcient coding (red). c) Tuning width as a function of preferred orientation. d) Tuning curves of cardinal and oblique neurons are more symmetric than those tuned to orientations in between. e) Both narrowly and broadly tuned neurons neurons show less asymmetry than neurons with tuning widths in between. neurons with tuning widths at the lower and upper end of the range are predicted to exhibit less asymmetry than those neurons whose widths lie in between these extremes (illustrated in Fig. 4e). These last two predictions have not been tested yet. 4.3 Predicted perceptual biases Our model framework also provides speciﬁc predictions for the expected perceptual biases. Humans show systematic biases in perceived orientation of visual stimuli such as e.g. arrays of Gabor patches (Fig. 5a,d). Two types of biases can be distinguished: First, perceived orientations show an absolute bias away from the cardinal orientations, thus away from the peaks of the orientation prior [2, 3]. We refer to these biases as absolute because they are typically measured by adjusting a noise-free reference until it matched the orientation of the test stimulus. Interestingly, these repulsive absolute biases are the larger the smaller the external stimulus noise is (see Fig. 5b). Second, the relative bias between the perceived overall orientations of a high-noise and a low-noise stimulus is toward the cardinal orientations as shown in Fig. 5c, and thus toward the peak of the prior distribution [3, 16]. The predicted perceptual biases of our model are shown Fig. 5e,f. We computed the likelihood function according to (2) and used the prior in (7). External noise was modeled by convolving the stimulus likelihood function with a Gaussian (different widths for different noise levels). The predictions well match both, the reported absolute bias away as well as the relative biases toward the cardinal orientations. Note, that our model framework correctly accounts for the fact that less external noise leads to larger absolute biases (see also discussion in section 3.2). 5 Discussion We have presented a modeling framework for perception that combines efﬁcient (en)coding and Bayesian decoding. Efﬁcient coding imposes constraints on the tuning characteristics of a population of neurons according to the stimulus distribution (prior). It thus establishes a direct link between prior and likelihood, and provides clear constraints on the latter for a Bayesian observer model of perception. We have shown that the resulting likelihoods are in general asymmetric, with 7 absolute bias (data) b c relative bias (data) -4 0 bias(deg) 4 a low-noise stimulus -90 e 90 absolute bias (model) low external noise high external noise 3 high-noise stimulus -90 f 0 90 relative bias (model) 0 bias(deg) d 0 attraction -3 repulsion -90 0 orientation (deg) 90 -90 0 orientation (deg) 90 Figure 5: Biases in perceived orientation: Human data vs. Model prediction. a,d) Low- and highnoise orientation stimuli of the type used in [3, 16]. b) Humans show absolute biases in perceived orientation that are away from the cardinal orientations. Data replotted from [2] (pink squares) and [3] (green (black) triangles: bias for low (high) external noise). c) Relative bias between stimuli with different external noise level (high minus low). Data replotted from [3] (blue triangles) and [16] (red circles). e,f) Model predictions for absolute and relative bias. heavier tails away from the prior peaks. We demonstrated that such asymmetric likelihoods can lead to the counter-intuitive prediction that a Bayesian estimator is biased away from the peaks of the prior distribution. Interestingly, such repulsive biases have been reported for human perception of visual orientation, yet a principled and consistent explanation of their existence has been missing so far. Here, we suggest that these counter-intuitive biases directly follow from the asymmetries in the likelihood function induced by efﬁcient neural encoding of the stimulus. The good match between our model predictions and the measured perceptual biases and orientation tuning characteristics of neurons in primary visual cortex provides further support of our framework. Previous work has suggested that there might be a link between stimulus statistics, neuronal tuning characteristics, and perceptual behavior based on efﬁcient coding principles, yet none of these studies has recognized the importance of the resulting likelihood asymmetries [16, 11]. We have demonstrated here that such asymmetries can be crucial in explaining perceptual data, even though the resulting estimates appear “anti-Bayesian” at ﬁrst sight (see also models of sensory adaptation [23]). Note, that we do not provide a neural implementation of the Bayesian inference step. However, we and others have proposed various neural decoding schemes that can approximate Bayes’ leastsquares estimation using efﬁcient coding [26, 25, 22]. It is also worth pointing out that our estimator is set to minimize total squared-error, and that other choices of the loss function (e.g. MAP estimator) could lead to different predictions. Our framework is general and should be directly applicable to other modalities. In particular, it might provide a new explanation for perceptual biases that are hard to reconcile with traditional Bayesian approaches [5]. Acknowledgments We thank M. Jogan and A. Tank for helpful comments on the manuscript. This work was partially supported by grant ONR N000141110744. 8 References [1] M. Jones, and B. C. Love. Bayesian fundamentalism or enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34, 169–231,2011. [2] D. P. Andrews. Perception of contours in the central fovea. Nature, 205:1218- 1220, 1965. [3] A. Tomassini, M. J.Morgam. and J. A. Solomon. Orientation uncertainty reduces perceived obliquity. Vision Res, 50, 541–547, 2010. [4] W. S. Geisler, D. Kersten. Illusions, perception and Bayes. Nature Neuroscience, 5(6):508- 510, 2002. [5] M. O. Ernst Perceptual learning: inverting the size-weight illusion. Current Biology, 19:R23- R25, 2009. [6] G. H. Henry, B. Dreher, P. O. Bishop. Orientation speciﬁcity of cells in cat striate cortex. J Neurophysiol, 37(6):1394-409,1974. [7] D. Rose, C. Blakemore An analysis of orientation selectivity in the cat’s visual cortex. Exp Brain Res., Apr 30;20(1):1-17, 1974. [8] N. V. Swindale. Orientation tuning curves: empirical description and estimation of parameters. Biol Cybern., 78(1):45-56, 1998. [9] R. L. De Valois, E. W. Yund, N. Hepler. The orientation and direction selectivity of cells in macaque visual cortex. Vision Res.,22, 531544,1982. [10] B. Li, M. R. Peterson, R. D. Freeman. The oblique effect: a neural basis in the visual cortex. J. Neurophysiol., 90, 204217, 2003. [11] D. Ganguli and E.P. Simoncelli. Implicit encoding of prior probabilities in optimal neural populations. In Adv. Neural Information Processing Systems NIPS 23, vol. 23:658–666, 2011. [12] M. D. McDonnell, N. G. Stocks. Maximally Informative Stimuli and Tuning Curves for Sigmoidal RateCoding Neurons and Populations. Phys Rev Lett., 101(5):058103, 2008. [13] H Helmholtz. Treatise on Physiological Optics (transl.). Thoemmes Press, Bristol, U.K., 2000. Original publication 1867. [14] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [15] D.C. Knill and W. Richards, editors. Perception as Bayesian Inference. Cambridge University Press, 1996. [16] A R Girshick, M S Landy, and E P Simoncelli. Cardinal rules: visual orientation perception reﬂects knowledge of environmental statistics. Nat Neurosci, 14(7):926–932, Jul 2011. [17] M. Jazayeri and M.N. Shadlen. Temporal context calibrates interval timing. Nature Neuroscience, 13(8):914–916, 2010. [18] A.A. Stocker and E.P. Simoncelli. Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience, pages 578–585, April 2006. [19] H.B. Barlow. Possible principles underlying the transformation of sensory messages. In W.A. Rosenblith, editor, Sensory Communication, pages 217–234. MIT Press, Cambridge, MA, 1961. [20] D.M. Coppola, H.R. Purves, A.N. McCoy, and D. Purves The distribution of oriented contours in the real world. Proc Natl Acad Sci U S A., 95(7): 4002–4006, 1998. [21] N. Brunel and J.-P. Nadal. Mutual information, Fisher information and population coding. Neural Computation, 10, 7, 1731–1757, 1998. [22] X-X. Wei and A.A. Stocker. Bayesian inference with efﬁcient neural population codes. In Lecture Notes in Computer Science, Artiﬁcial Neural Networks and Machine Learning - ICANN 2012, Lausanne, Switzerland, volume 7552, pages 523–530, 2012. [23] A.A. Stocker and E.P. Simoncelli. Sensory adaptation within a Bayesian framework for perception. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages o 1291–1298. MIT Press, Cambridge, MA, 2006. Oral presentation. [24] D.C. Knill. Robust cue integration: A Bayesian model and evidence from cue-conﬂict studies with stereoscopic and ﬁgure cues to slant. Journal of Vision, 7(7):1–24, 2007. [25] Deep Ganguli. Efﬁcient coding and Bayesian inference with neural populations. PhD thesis, Center for Neural Science, New York University, New York, NY, September 2012. [26] B. Fischer. Bayesian estimates from heterogeneous population codes. In Proc. IEEE Intl. Joint Conf. on Neural Networks. IEEE, 2010. 9

6 0.59240228 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

7 0.5716325 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

8 0.54543322 195 nips-2012-Learning visual motion in recurrent neural networks

9 0.53366244 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

10 0.51290119 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

11 0.51103657 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

12 0.50739717 192 nips-2012-Learning the Dependency Structure of Latent Factors

13 0.50446898 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

14 0.49982467 294 nips-2012-Repulsive Mixtures

15 0.49944699 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

16 0.49891734 37 nips-2012-Affine Independent Variational Inference

17 0.4839429 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

18 0.472675 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

19 0.46527269 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

20 0.46513712 65 nips-2012-Cardinality Restricted Boltzmann Machines

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.044), (17, 0.02), (21, 0.052), (38, 0.119), (39, 0.015), (42, 0.016), (54, 0.02), (55, 0.03), (74, 0.063), (76, 0.101), (77, 0.014), (80, 0.09), (92, 0.073), (95, 0.252)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78222066 365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding

Author: Philip Sterne, Joerg Bornschein, Abdul-saboor Sheikh, Joerg Luecke, Jacquelyn A. Shelton

2 0.7344498 5 nips-2012-A Conditional Multinomial Mixture Model for Superset Label Learning

Author: Liping Liu, Thomas G. Dietterich

Abstract: In the superset label learning problem (SLL), each training instance provides a set of candidate labels of which one is the true label of the instance. As in ordinary regression, the candidate label set is a noisy version of the true label. In this work, we solve the problem by maximizing the likelihood of the candidate label sets of training instances. We propose a probabilistic model, the Logistic StickBreaking Conditional Multinomial Model (LSB-CMM), to do the job. The LSBCMM is derived from the logistic stick-breaking process. It ﬁrst maps data points to mixture components and then assigns to each mixture component a label drawn from a component-speciﬁc multinomial distribution. The mixture components can capture underlying structure in the data, which is very useful when the model is weakly supervised. This advantage comes at little cost, since the model introduces few additional parameters. Experimental tests on several real-world problems with superset labels show results that are competitive or superior to the state of the art. The discovered underlying structures also provide improved explanations of the classiﬁcation predictions. 1

3 0.73319191 260 nips-2012-Online Sum-Product Computation Over Trees

Author: Mark Herbster, Stephen Pasteris, Fabio Vitale

Abstract: We consider the problem of performing efﬁcient sum-product computations in an online setting over a tree. A natural application of our methods is to compute the marginal distribution at a vertex in a tree-structured Markov random ﬁeld. Belief propagation can be used to solve this problem, but requires time linear in the size of the tree, and is therefore too slow in an online setting where we are continuously receiving new data and computing individual marginals. With our method we aim to update the data and compute marginals in time that is no more than logarithmic in the size of the tree, and is often signiﬁcantly less. We accomplish this via a hierarchical covering structure that caches previous local sum-product computations. Our contribution is three-fold: we i) give a linear time algorithm to ﬁnd an optimal hierarchical cover of a tree; ii) give a sum-productlike algorithm to efﬁciently compute marginals with respect to this cover; and iii) apply “i” and “ii” to ﬁnd an efﬁcient algorithm with a regret bound for the online allocation problem in a multi-task setting. 1

4 0.72382605 26 nips-2012-A nonparametric variable clustering model

Author: Konstantina Palla, Zoubin Ghahramani, David A. Knowles

Abstract: Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to ﬁnd a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover blockdiagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems. 1

5 0.62908238 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

Author: Jeff Beck, Alexandre Pouget, Katherine A. Heller

Abstract: Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efﬁcient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be naturally implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identiﬁcation of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classiﬁcation, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb’s rule and describe an extension of this work which allows us to deal with time varying and correlated latent causes. 1 Introduction to Probabilistic Inference in Cortex Probabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for dealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that behavioural experiments reliably demonstrate that humans and animals behave in a manner consistent with optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor [4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents probability distribution functions of task relevant stimuli rather than just single values. While there 1 are many ways to represent functions, Bayes rule tells us that when it comes to probability distribution functions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states that any pattern of activity, r, that efﬁciently represents a probability distribution over some task relevant quantity s, must satisfy the relationship p(s|r) ∝ p(r|s)p(s), where p(r|s) is the stimulus conditioned likelihood function that speciﬁes the form of neural variability, p(s) gives the prior belief regarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given the representation r . Of course, it is unlikely that the nervous system consistently achieves this level of optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as characterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning machine such as the brain. The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). When this is the case, the precise form of the neural variability informs the nature of the neural code. For example, the exponential family of statistical models with linear sufﬁcient statistics has been shown to be ﬂexible enough to model the ﬁrst and second order statistics of in vivo recordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i.e. log p(s|r) = h(s) · r − log Z(r) (1) Here, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard dot product, and Z(r) is the partition function which serves to normalize the posterior. This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. This includes, linear PPC implementations of cue combination[8], evidence integration over time, maximum likelihood and maximum a posterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman ﬁltering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. Unfortunately, this research program has been a piecemeal effort that has largely proceeded by building neural networks designed deal with particular problems. As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. In this work, we will partially address this short coming by showing that Variation Bayesian Expectation Maximization (VBEM) algorithm provides a general scheme for approximate inference and learning with linear PPCs. In section 2, we brieﬂy review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. Because this section describes the VB-PPC approach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. This result suggests that this particular non-linearity may be involved in very general and fundamental probabilistic computation, rather than simply playing a role in gain modulation. In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term ﬂuctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. Properly implemented it can result a fast alternative to sampling based methods of inference such as MCMC[19] sampling. Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which speciﬁes the joint distribution p(X, Θ, Z). This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i.e. we wish to compute, p(Z|X) ∝ p(X, Θ, Z)dΘ (2) When the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms simplify this marginalization by approximating the complex joint distribution over behaviourally relevant latents and parameters, p(Θ, Z|X), with a distribution q(Θ, Z) for which integrals of this form are easier to deal with in some sense. There is some art to choosing the particular form for the approximating distribution to make the above integral tractable, however, a factorized approximation is common, i.e. q(Θ, Z) = qΘ (Θ)qZ (Z). Regardless, for any given observation X, the approximate posterior is found by minimizing the Kullback-Leibler divergence between q(Θ, Z) and p(Θ, Z|X). When a factorized posterior is assumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm ﬁnds a local minimum of the KL divergence by iteratively updating, qΘ (Θ) and qZ (Z) according to the scheme n log qΘ (Θ) ∼ log p(X, Θ, Z) n qZ (Z) and n+1 log qZ (Z) ∼ log p(X, Θ, Z) n qΘ (Θ) (3) Here the brackets indicate an expected value taken with respect to the subscripted probability distribution function and the tilde indicates equality up to a constant which is independent of Θ and Z. The key property to note here is that the approximate posterior which results from this procedure is in an exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows for the straightforward construction of networks which implement the VBEM algorithm with linear PPC’s in the following way. If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . Z (4) Here the stimulus dependent kernels hZ (Z) and hΘ (Θ) are chosen so that their outer product results in a basis that spans the function space on Z × Θ given by log p(X, Θ, Z) for every X. This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq. 5 until convergence will ﬁnd the VBEM approximation to the posterior. This is one way to build a neural network implementation of the VB algorithm. However, its not the only way. In general, any dynamical system which has stable ﬁxed points in common with Eq. 5 can also be said to implement the VBEM algorithm. In the example below we will take advantage of this ﬂexibility in order to build biologically plausible neural network implementations. 3 Response! to Mixture ! of Odors! Single Odor Response Cause Intensity Figure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identiﬁcation depicted in Fig. 1. A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. This results in a highly distributed code for each odor. Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. A simple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA) [18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. It also assumes that there are K topics and that each of these topics appears in different proportions in each document, e.g. 80% of the words in a document might be concerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn from a distribution over words associated with that topic, e.g. when talking about coffee you have a 5% chance of using the word ’bitter’. The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. We can map the generative model for LDA onto the task of spike demixing in cortex by letting topics become latent causes or odors, words become neurons, word occurrences become spikes, word distributions associated with each topic become patterns of neural activity associated with each cause, and different documents become the observed patterns of neural activity on different trials. This equivalence is made explicit in Fig. 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. For each topic k = 1, . . . , K, (a) Distribution over words βk ∼ Dirichlet(η0 ) 2. For document d = 1, . . . , D, (a) Distribution over topics θd ∼ Dirichlet(α0 ) (b) For word m = 1, . . . , Ωd i. Topic assignment zd,m ∼ Multinomial(θd ) ii. Word assignment ωd,m ∼ Multinomial(βzm ) 1. For latent cause k = 1, . . . , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. For scene d = 1, . . . , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . . . , Ωd i. Cause assignment zd,m ∼ Multinomial(θd ) ii. Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. Text related attributes on the left, in red, have been replaced with neural attributes on the right, in green. rather than topic assignments. Speciﬁcally, we deﬁne, Rd,j to be the number of times neuron j ﬁres during trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). (6) is equivalent to the topic models described above. Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. Note that, the problem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to that considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to infer a probability distribution over these quantities rather than maximum likelihood estimates. See supplement for details. Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . . . , D, with a factorized distribution of the form, qN (N)qc (c)qβ (β). This results in marginal posterior distributions q (β:,k |η:,k ), q cd,k |αd,k , C −1 + 1 ), and q (Nd,j,: | log pd,j,: , Rd,i ) which are Dirichlet, Gamma, and Multinomial respectively. Here, the parameters η:,k , αd,k , and log pd,j,: are the natural parameters of these distributions. The VBEM update algorithm yields update rules for these parameters which are summarized in Fig. 3 Algorithm1. Algorithm 1: Batch VB updates 1: while ηj,k not converged do 2: for d = 1, · · · , D do 3: while pd,j,k , αd,k not converged do 4: αd,k → α0 + j Rd,j pd,j,k 5: pd,j,k → Algorithm 2: Online VB updates 1: for d = 1, · · · , D do 2: reinitialize pj,k , αk ∀j, k 3: while pj,k , αk not converged do 4: αk → α0 + j Rd,j pj,k 5: pj,k → exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αk ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αi ) exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αd,k ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αd,i ) 6: end while 7: end for 8: ηj,k = η 0 + 9: end while end while ηj,k → (1 − dt)ηj,k + dt(η 0 + Rd,j pj,k ) 8: end for 6: 7: d Rd,j pd,j,k Figure 3: Here ηk = j ηj,k and ψ(x) is the digamma function so that exp ψ(x) is a smoothed ¯ threshold linear function. Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. Fortunately, an online version of this variational algorithm was recently proposed and shown to give 5 superior results when compared to the batch learning algorithm[21]. This algorithm replaces the sum over d in update equation for ηj,k with an incremental update based upon only the most recently observed pattern of spikes. See Fig. 3 Algorithm 2. 4.1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. One way to accomplish this is to simply assume that the ﬁring rates of output neurons are equal to the positive valued αk parameters. Fig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. A continuous time dynamical system with this feature and the property that it shares its ﬁxed points with the LDA algorithm is given by d ¯ Nj,k dt d αk dt ¯ ¯ = wj,k Rj − Rj Nj,k = (7) ¯ Nj,k exp (ψ (¯k )) (α0 − αk ) + exp (ψ (αk )) η (8) i ¯ where Rj = k wj,k exp (ψ (αk )), and wj,k = exp (ψ (ηj,k )). Note that, despite its form, it is Eq. 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . Regardless, this network has a variety of interesting properties that align well with biology. It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. It also predicts superlinear facilitation. Speciﬁcally, the ﬁnal term on the right of Eq. 8 indicates that more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be implemented via recurrent excitation at the population level. In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally, the ηj in Eq. 8 represents a cell wide ’leak’ parameter that indicates that the total leak should be ¯ roughly proportional to the sum total weight of the synapses which drive the neuron. This predicts that cells that are highly sensitive to input should also decay back to baseline more quickly. This implementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the online update rule for the ηj,k parameters can be implemented by simply correlating the activity at ¯ each synapse, Nj,k with activity at the soma αj via the equation: τL d ¯ wj,k = exp (ψ (¯k )) (η0 − 1/2 − wj,k ) + Nj,k exp ψ (αk ) η dt (9) where τL is a long time constant for learning and we have used the fact that exp (ψ (ηjk )) ≈ ηjk −1/2 for x > 1. For a detailed derivation see the supplementary material. 5 Dynamic Document Model LDA is a rather simple generative model that makes several unrealistic assumptions about mixtures of sensory and cortical spikes. In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). We call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model smooth changes in the distribution of topics over the course of a single document. 5.1 DDM Model Thus the generative model for the DDM is as follows: 1. For latent cause k = 1, . . . , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. For scene t = 1, . . . , T , (a) Log intensity of causes c(t) ∼ Normal(Λct−1 , ΣD ) (b) Number of spikes in neuron j resulting from cause k, Nj,k (t) ∼ Poisson(βj,k exp ck (t)) (c) Number of spikes in neuron j, Rj (t) = k Nj,k (t) This model bears many similarities to the Correlated and Dynamic topic models[22], but models dynamics over a short time scale, where the dynamic relationship (Λ, ΣD ) is important. 5.2 Network Implementation Once again the quantity of interest is the current distribution of latent causes, p(c(t)|R(τ ), τ = 0..T ). If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman ﬁlter with parameters (Λ, ΣD ). A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman ﬁlter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. Thus, the patterns of neural activity which linearly encode them must do so as well. When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. All that is required is an additional inhibitory population which targets the somata in the output population. See Fig. 5. 7 Natural Parameters Natural Parameters (α) 0.4 200 450 180 0.3 Network Estimate Network Estimate 500 400 350 300 250 200 150 100 0.1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 140 120 0.4 0.3 100 0.2 80 0.1 0 60 40 0.4 20 50 0 0 0.2 160 0 0 0.3 0.2 20 40 60 80 100 120 VBEM Estimate VBEM Estimate 140 160 180 200 0.1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the matrix Σ−1 (t) and Σ−1 µ(t) of the distribution over log topic intensities. (Right) Three example traces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable) show MAP estimates for the network and VBEM algorithms. 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same ﬁxed points as the standard VBEM algorithm. As a result, it is possible for the two algorithms to converge to different local minima of the KL divergence. For the network implementation of LDA we ﬁnd good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). This agreement, however, is not perfect, especially when posterior predicted concentrations are low. In part, this is due to the fact we are presenting the network with difﬁcult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. As a result, the objective function (KL divergence) is littered with local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. See Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for the VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. For simulation details please see the supplement. 7 Discussion and Conclusion In this work we presented a general framework for inference and learning with linear Probabilistic Population codes. This framework takes advantage of the fact that the Variational Bayesian Expectation Maximization algorithm generates approximate posterior distributions which are in an exponential family form. This is precisely the form needed in order to make probability distributions representable by a linear PPC. We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. We chose this problem because it has many of the features of the canonical problem faced by nearly every layer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer below. Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. 8 References [1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual review of psychology, 55:271–304, January 2004. [2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, 2002. [3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature neuroscience, 5(6):598–604, 2002. [4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6): 740–6, 2000. o [5] Konrad P K¨ rding and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427 (6971):244–7, 2004. [6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–15, 2004. [7] Erno T´ gl´ s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. e a Pure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033): 1054–9, 2011. [8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nature Neuroscience, 2006. [9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N Shadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision making. Neuron, 60(6):1142–52, 2008. [10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization. Journal of Neuroscience, 31(43):15310–15319, 2011. [11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075–80, 2007. [12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye movements. Nature, 1995. [13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–45, 2011. [14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy of Sciences, 1969. [15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior and neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011. [16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992. [17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of neuroscience : the ofﬁcial journal of the Society for Neuroscience, 17(21):8621–44, 1997. [18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003. [19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL, 2003. [20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788–91, 1999. [21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010. [22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006. [23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational biology, 2011. 9

6 0.62834567 65 nips-2012-Cardinality Restricted Boltzmann Machines

7 0.62499815 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

8 0.62338561 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

9 0.62221003 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

10 0.61946493 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

11 0.61591953 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

12 0.61524439 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

13 0.61441559 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

14 0.61314797 197 nips-2012-Learning with Recursive Perceptual Representations

15 0.61291736 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

16 0.61124164 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

17 0.6110816 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

18 0.61063647 193 nips-2012-Learning to Align from Scratch

19 0.61038256 168 nips-2012-Kernel Latent SVM for Visual Recognition

20 0.60947305 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs