nips nips2009 nips2009-43 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sebastian Gerwinn, Leonard White, Matthias Kaschube, Matthias Bethge, Jakob H. Macke
Abstract: Imaging techniques such as optical imaging of intrinsic signals, 2-photon calcium imaging and voltage sensitive dye imaging can be used to measure the functional organization of visual cortex across different spatial and temporal scales. Here, we present Bayesian methods based on Gaussian processes for extracting topographic maps from functional imaging data. In particular, we focus on the estimation of orientation preference maps (OPMs) from intrinsic signal imaging data. We model the underlying map as a bivariate Gaussian process, with a prior covariance function that reflects known properties of OPMs, and a noise covariance adjusted to the data. The posterior mean can be interpreted as an optimally smoothed estimate of the map, and can be used for model based interpolations of the map from sparse measurements. By sampling from the posterior distribution, we can get error bars on statistical properties such as preferred orientations, pinwheel locations or pinwheel counts. Finally, the use of an explicit probabilistic model facilitates interpretation of parameters and quantitative model comparisons. We demonstrate our model both on simulated data and on intrinsic signaling data from ferret visual cortex. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Bayesian estimation of orientation preference maps Sebastian Gerwinn MPI for Biological Cybernetics and University of T¨ bingen u Computational Vision and Neuroscience Spemannstrasse 41, 72076 T¨ bingen u sgerwinn@tuebingen. [sent-1, score-0.746]
2 de Abstract Imaging techniques such as optical imaging of intrinsic signals, 2-photon calcium imaging and voltage sensitive dye imaging can be used to measure the functional organization of visual cortex across different spatial and temporal scales. [sent-12, score-1.21]
3 Here, we present Bayesian methods based on Gaussian processes for extracting topographic maps from functional imaging data. [sent-13, score-0.589]
4 In particular, we focus on the estimation of orientation preference maps (OPMs) from intrinsic signal imaging data. [sent-14, score-0.908]
5 We model the underlying map as a bivariate Gaussian process, with a prior covariance function that reflects known properties of OPMs, and a noise covariance adjusted to the data. [sent-15, score-0.758]
6 The posterior mean can be interpreted as an optimally smoothed estimate of the map, and can be used for model based interpolations of the map from sparse measurements. [sent-16, score-0.566]
7 By sampling from the posterior distribution, we can get error bars on statistical properties such as preferred orientations, pinwheel locations or pinwheel counts. [sent-17, score-0.842]
8 We demonstrate our model both on simulated data and on intrinsic signaling data from ferret visual cortex. [sent-19, score-0.288]
9 The most prominent example of such a topographic organization is the layout of neurons according to their preferred orientation, the orientation preference map (OPM) [1, 2, e. [sent-21, score-0.792]
10 The statistical structure of OPMs [3, 4] and other topographic maps has been the focus of extensive 1 research, as have been the relationships between different maps [5]. [sent-24, score-0.54]
11 Orientation preference maps can be measured using optical imaging of intrinsic signals, voltage sensitive dye imaging, functional magnetic resonance imaging [6], or 2-photon calcium imaging [2, 7]. [sent-25, score-1.422]
12 Therefore, statistical pre-processing of the data is required in order to extract topographic maps from the raw experimental data. [sent-29, score-0.363]
13 Here, we propose to use Gaussian process methods [8] for estimating topographic maps from noisy imaging data. [sent-30, score-0.544]
14 In the case of OPMs, this amounts to estimating the preferred orientation at each pixel by vector averaging the different stimulus orientations weighted according to the evoked responses. [sent-33, score-0.537]
15 One disadvantage of this approach is that the frequency characteristics of the bandpass filters are free parameters which are often set ad-hoc, and may have a substantial impact on the statistics of the obtained map [9, 10]. [sent-35, score-0.28]
16 Rather, they attempt to find ’good’ maps by searching for filters which are maximally discriminative between different stimulus conditions. [sent-40, score-0.363]
17 Here, we attempt to combine the strengths of the classical and discriminative models by combining prior knowledge about maps with flexible noise models into a common probabilistic model. [sent-42, score-0.417]
18 We encode prior knowledge about the statistical structure of OPMs in the covariance function of a Gaussian Process prior over maps. [sent-43, score-0.319]
19 Compared to previously proposed methods for analyzing multivariate imaging methods, the GP approach has a number of advantages: • Optimal smoothing: The mean of the posterior distribution can be interpreted as an optimally smoothed map. [sent-45, score-0.49]
20 • Non-isotropic and correlated noise: In contrast to the standard smoothing approach, noise with correlations across pixels as well as non-constant variances can be modelled. [sent-49, score-0.417]
21 • Interpolations: The model returns an estimate of the preferred orientation at any location, not only at those at which measurements were obtained. [sent-50, score-0.304]
22 , for artifact removal, or for inferring maps from multi-electrode recordings. [sent-53, score-0.26]
23 • Model based uncertainty estimates: The posterior variances at each pixel can be used to compute point-wise error bars at each pixel location [9, 11]. [sent-55, score-0.351]
24 By sampling from the posterior (using the full posterior covariance), we can also get error bars on topological or global properties of the map, such as pinwheel counts or locations. [sent-56, score-0.539]
25 1 Methods Encoding Model We model an imaging experiment, where at each of N trials, the activity at n pixels is measured. [sent-62, score-0.353]
26 The response ri (x) at trial i to a stimulus parameterised by Vi is given by d ri (x) = vki mk (x) + i (x) = vi mk (x) + i (x), (1) k=1 i. [sent-63, score-0.518]
27 the mean response at each pixel is modelled to be a linear function of some stimulus parameters vki . [sent-65, score-0.265]
28 We refer to the coefficients mk (x) as feature maps, as they indicate the selectivity of pixel x to stimulus feature k. [sent-68, score-0.254]
29 In the specific case of modelling an orientation preference map, we have d = 2 and vi = (cos(2θi ), sin(2θi )) . [sent-69, score-0.408]
30 Then, the argument of the complex number m (x) = m1 (x) + im2 (x) is the preferred orientation at location x, whereas the absolute value of m (x) is a measure of its selectivity. [sent-70, score-0.34]
31 We assume that the noise-residuals ε are normally distributed with covariance Σ , and a Gaussian prior with covariance Km for the feature map vector m. [sent-72, score-0.674]
32 if different feature maps are statistically independent a priori, and the stimuli are un-correlated on average, i. [sent-75, score-0.286]
33 Hence, inference for different maps ’de-couples’, and we do not have to store the full joint covariance over all d maps. [sent-78, score-0.451]
34 2 Choosing a prior We need to specify the covariance function K(m(x), m(x )) of the prior distribution over maps. [sent-80, score-0.319]
35 As cortical maps, and in particular orientation preference maps, have been studied extensively in the past [5], we actually have prior knowledge (rather than just prior assumptions) to guide the choice of a prior. [sent-81, score-0.514]
36 It is known that orientation preference maps are smooth [2] and that they have a semi-periodic structure of regularly spaced columns. [sent-82, score-0.546]
37 Hence, filtering white noise with appropriately chosen filters [18] yields maps which visually look like measured OPMs (see Fig. [sent-83, score-0.364]
38 This will result in a prior which is uncorrelated in the different maps component, i. [sent-91, score-0.338]
39 B) Radial component of prior covariance function and of covariance of raw data. [sent-104, score-0.479]
40 3 Approximate inference The formulas for the posterior mean and covariance involve covariance matrices over all pixels. [sent-108, score-0.486]
41 On a map of size nx × ny , there are n = nx × ny pixels, so we would have to store and compute with matrices of size n × n, which would limit this approach to maps of relatively small size. [sent-109, score-0.578]
42 In this case, the posterior covariance can be calculated without ever having to store (or even invert) the full prior covariance: Σpost = Id ⊗ Kc − β −1 Kc Σ−1 − Σ−1 G βIq + G Σ−1 G −1 G Σ−1 Kc , (5) where β = 2/N . [sent-115, score-0.452]
43 We restrict the form of the noise covariance either to be diagonal (i. [sent-116, score-0.263]
44 In other words, the functional form of the covariance matrix is assumed to be the same as in factor analysis models [22, 23]: The low rank term G models correlation across pixels, whereas the diagonal matrix D models independent noise. [sent-120, score-0.362]
45 We assume this model to regularize the noise covariance to ensure that the noise covariance has full rank even when the number of data-points is less than the number of pixels [22]. [sent-121, score-0.656]
46 The matrices G and D can be fit using expectation maximization without ever having to calculate the full noise covariance across all pixels. [sent-122, score-0.294]
47 We initialize the noise covariance by calculating the noise variances for each stimulus condition, and averaging this initial estimate across stimulus conditions. [sent-123, score-0.66]
48 We iterate between calculating the posterior mean (using the current estimate of Σ ), and obtaining a pointestimate of the most likely noise covariance given the mean [24]. [sent-124, score-0.391]
49 4 0 GP with correlations GP, no correlations Smoothing (optimized) 16 40 80 160 320 Simulus presentations 640 Figure 2: Illustration on synthetic data: A) Ground truth map used to generate the data. [sent-132, score-0.469]
50 E) Reconstruction by smoothing with fixed Gaussian filter, filter-width optimized by maximizing correlation with ground truth. [sent-137, score-0.294]
51 1 Results Illustration on synthetic data To illustrate the ability of our method to recover maps from noisy recordings, we generated a synthetic map (a sample from the prior distribution, ’true map’, see Fig. [sent-140, score-0.544]
52 We reconstructed the map using our GP method (low rank approximation of rank q = 1600, noise correlations of rank q = 5) on data sets of different sizes (N = 8 ∗ (2, 5, 10, 20, 30, 40, 80)). [sent-143, score-0.531]
53 We use the posterior variances to also calculate a pointwise 95% confidence interval on the preferred orientation at each location, shown in Fig. [sent-145, score-0.475]
54 As expected, the confidence intervals are biggest near pinwheels, where the orientation selectivity of pixels is low, and therefore the preferred orientation is not well defined. [sent-147, score-0.613]
55 To evaluate the performance of the model, we quantified its reconstruction performance by computing the correlation coefficient of the posterior mean and the true map, each represented as a long vector with 2n elements. [sent-148, score-0.288]
56 We compared the GP map against a map obtained by filtering the raw map (Fig. [sent-149, score-0.789]
57 We can see that the GP map converges to the true map more quickly than the smoothed map (Fig. [sent-153, score-0.836]
58 For example, using 16 stimulus presentations, the smoothed map has a correlation with the ground truth of 0. [sent-155, score-0.618]
59 For the simple smoothing method, about 120 presentations would be required to achieve this performance level. [sent-158, score-0.27]
60 2 Application to data from ferret visual cortex To see how well the method works on real data, we used it to analyze data from an intrinsic signal optical imaging experiment. [sent-164, score-0.622]
61 The central portion of the visuotopic map in visual areas V1 and V2 of an anesthetized ferret was imaged with red light while square wave gratings (spatial frequency 0. [sent-165, score-0.448]
62 The response ri was taken to be the average activity in a 5 second window relative to baseline Each response vector ri was normalized to have mean 0 and standard deviation 1, no spatial filtering was performed. [sent-171, score-0.325]
63 The large data set with a total of 800 stimulus presentations made it possible to quantify the performance of our model by comparing it to unsmoothed maps. [sent-173, score-0.297]
64 However, the GP method itself is designed to also work robustly on smaller data sets, and we are primarily interested in its performance in estimating maps using only few stimulus presentations. [sent-175, score-0.363]
65 3 Bayesian estimation of orientation preference maps For real measured data, we do not know ground truth to estimate the performance of our model. [sent-177, score-0.596]
66 Therefore, we used 5% of the data for estimating the map, and compared this map with the (unsmoothed) map estimated on the other 95% of data, which served as our proxy for ground truth. [sent-178, score-0.572]
67 As above, we compared the GP map against one obtained by smoothing with a Gaussian kernel, where the kernel width of the smoothing kernel was chosen by maximizing its correlation with (our proxy for) the ground truth. [sent-179, score-0.844]
68 The GP map outperformed the smoothing map consistently: For 18 out of 20 different splits into training and test data, the correlation of the GP map was higher (p = 2 × 10−4 , average correlations c = 0. [sent-180, score-1.036]
69 The same held true when we smoothed maps with a Difference of Gaussians filter rather than a Gaussian (19 out of 20, average correlation c = 0. [sent-185, score-0.415]
70 B) Smoothed map estimated from other 80 stimuli, filter width obtained by maximizing the correlation to map A. [sent-189, score-0.624]
71 The GP has a correlation with the map shown in A) of 0. [sent-191, score-0.335]
72 The analysis above shows that, even if when optimized the filter-width for smoothing (which would not be possible in a real experiment), the GP still outperforms the approach of smoothing with a Gaussian window. [sent-195, score-0.31]
73 In addition, it is important to keep in mind that using the posterior mean as a clean estimate of the map is only one feature of our model. [sent-196, score-0.374]
74 In the following, we will use the GP model to optimally interpolate a sparsely sampled map, and to the posterior distribution to obtain error bars over the pinwheel-counts and locations of the map. [sent-197, score-0.286]
75 4 Interpolating the map The posterior mean µ(x) of the model can be evaluated for any x. [sent-199, score-0.374]
76 This makes it possible to extend the map to locations at which no data was recorded. [sent-200, score-0.326]
77 We explored this scenario by cutting out a region of the map described above (inside of ellipse in Fig. [sent-204, score-0.284]
78 The correlation between the true map and the GP map in the filled-in region was 0. [sent-206, score-0.581]
79 Provided that the electrode spacing is small enough, it should be possible to reconstruct at least a rough estimate of the map from such discrete measurements. [sent-211, score-0.278]
80 Then, we attempted to infer the full map using only these 49 measurements, and our prior knowledge about OPMs encoded in the prior covariance. [sent-213, score-0.386]
81 As before, the GP map outperforms the smoothing approach (c = 0. [sent-216, score-0.401]
82 5 Posterior uncertainty As both our prior and the likelihood are Gaussian, the posterior distribution is also Gaussian, with mean µpost and covariance Σpost . [sent-226, score-0.377]
83 By sampling from this posterior distribution, we can get error bars not only on the preferred orientations in individual pixels (as we did for Fig. [sent-227, score-0.418]
84 For example, the location [10] and total number [3, 4] of pinwheels (singularities at which both map components vanish) has received considerable attention in the past. [sent-229, score-0.329]
85 Figure 5 A) and B) shows two samples from the posterior distribution, which differ both in their pinwheel locations and counts (A: 39, B: 28, C:31). [sent-230, score-0.445]
86 To evaluate our certainty in the pinwheel locations, we calculate a two-dimensional histogram of pinwheel locations across samples (Fig. [sent-231, score-0.554]
87 4 Discussion We introduced Gaussian process methods for estimating orientation preference maps from noisy imaging data. [sent-235, score-0.778]
88 By integrating prior knowledge about the spatial structure of OPMs with a flexible noise model, we aimed to combine the strengths of classical analysis methods with discriminative approaches. [sent-236, score-0.284]
89 While we focused on the analysis of intrinsic signal imaging data, our methods are expected to be also applicable to other kinds of imaging data. [sent-237, score-0.594]
90 2 80 160 240 320 400 Stimulus presentations Figure 5: Posterior uncertainty: A B C) Three samples from the posterior distribution, using 80 stimuli (zoomed in for better visibility). [sent-243, score-0.301]
91 D E) Density-plot of pinwheel locations when map is estimated with 40 and 800 stimuli, respectively. [sent-244, score-0.563]
92 resonance imaging is widely used as a non-invasive means of measuring brain activity, and has been reported to be able to estimate orientation preference maps in human subjects [6]. [sent-246, score-0.778]
93 assumed the higher-order correlations of the maps to be minimal. [sent-252, score-0.282]
94 However, it is unclear how general noise correlation structures can be integrated in these models in a flexible manner, and whether the additional complexity of using a more involved noise model would lead to a substantial increase in performance. [sent-256, score-0.257]
95 Functional imaging with cellular resolution reveals precise micro-architecture in visual cortex. [sent-262, score-0.289]
96 Self-organization and the selection of pinwheel density in visual cortical development. [sent-271, score-0.35]
97 Principal compou nent analysis and blind separation of sources for optical imaging of intrinsic signals. [sent-291, score-0.44]
98 Physical limits to spatial resolution of optical recording: clarifying the spatial structure of cortical hypercolumns. [sent-294, score-0.278]
99 New paradigm for optical imaging: temporally encoded maps of intrinsic signal. [sent-307, score-0.436]
100 Neuronal selectivity and local map structure in visual cortex. [sent-362, score-0.341]
wordName wordTfidf (topN-words)
[('gp', 0.361), ('opms', 0.284), ('map', 0.246), ('pinwheel', 0.237), ('imaging', 0.232), ('maps', 0.228), ('orientation', 0.19), ('covariance', 0.179), ('kc', 0.16), ('smoothing', 0.155), ('post', 0.139), ('stimulus', 0.135), ('posterior', 0.128), ('preference', 0.128), ('presentations', 0.115), ('preferred', 0.114), ('optical', 0.108), ('bingen', 0.1), ('intrinsic', 0.1), ('smoothed', 0.098), ('ferret', 0.095), ('vi', 0.09), ('correlation', 0.089), ('noise', 0.084), ('topographic', 0.084), ('pixels', 0.081), ('locations', 0.08), ('ri', 0.074), ('reconstruction', 0.071), ('mar', 0.071), ('kaschube', 0.071), ('prior', 0.07), ('interpolations', 0.062), ('km', 0.061), ('stimuli', 0.058), ('spemannstrasse', 0.057), ('gaussian', 0.057), ('spatial', 0.057), ('visual', 0.057), ('cortical', 0.056), ('correlations', 0.054), ('white', 0.052), ('raw', 0.051), ('gratings', 0.05), ('ground', 0.05), ('rank', 0.049), ('pixel', 0.049), ('orientations', 0.049), ('acad', 0.048), ('natl', 0.048), ('cybernetics', 0.048), ('pinwheels', 0.047), ('unsmoothed', 0.047), ('sci', 0.046), ('bars', 0.046), ('functional', 0.045), ('ltering', 0.045), ('store', 0.044), ('width', 0.043), ('variances', 0.043), ('dye', 0.041), ('jakob', 0.041), ('knight', 0.041), ('vki', 0.041), ('activity', 0.04), ('uncorrelated', 0.04), ('response', 0.04), ('lter', 0.039), ('mpi', 0.039), ('selectivity', 0.038), ('aimed', 0.038), ('calcium', 0.038), ('ellipse', 0.038), ('voltage', 0.038), ('kernel', 0.038), ('neuroimage', 0.037), ('location', 0.036), ('signaling', 0.036), ('strengths', 0.035), ('dec', 0.034), ('bandpass', 0.034), ('leonard', 0.034), ('duke', 0.032), ('electrode', 0.032), ('artifact', 0.032), ('integrative', 0.032), ('cov', 0.032), ('biological', 0.032), ('optimally', 0.032), ('mk', 0.032), ('entropy', 0.031), ('ever', 0.031), ('neurophysiol', 0.031), ('organization', 0.03), ('signal', 0.03), ('wolf', 0.03), ('proxy', 0.03), ('isotropic', 0.03), ('monkey', 0.03), ('nx', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 43 nips-2009-Bayesian estimation of orientation preference maps
Author: Sebastian Gerwinn, Leonard White, Matthias Kaschube, Matthias Bethge, Jakob H. Macke
Abstract: Imaging techniques such as optical imaging of intrinsic signals, 2-photon calcium imaging and voltage sensitive dye imaging can be used to measure the functional organization of visual cortex across different spatial and temporal scales. Here, we present Bayesian methods based on Gaussian processes for extracting topographic maps from functional imaging data. In particular, we focus on the estimation of orientation preference maps (OPMs) from intrinsic signal imaging data. We model the underlying map as a bivariate Gaussian process, with a prior covariance function that reflects known properties of OPMs, and a noise covariance adjusted to the data. The posterior mean can be interpreted as an optimally smoothed estimate of the map, and can be used for model based interpolations of the map from sparse measurements. By sampling from the posterior distribution, we can get error bars on statistical properties such as preferred orientations, pinwheel locations or pinwheel counts. Finally, the use of an explicit probabilistic model facilitates interpretation of parameters and quantitative model comparisons. We demonstrate our model both on simulated data and on intrinsic signaling data from ferret visual cortex. 1
2 0.19461085 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
Author: Jaakko Luttinen, Alexander T. Ihler
Abstract: We present a probabilistic factor analysis model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.
3 0.17233352 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features
Author: Anibal Figueiras-vidal, Miguel Lázaro-gredilla
Abstract: We present a general inference framework for inter-domain Gaussian Processes (GPs) and focus on its usefulness to build sparse GP models. The state-of-the-art sparse GP model introduced by Snelson and Ghahramani in [1] relies on finding a small, representative pseudo data set of m elements (from the same domain as the n available data elements) which is able to explain existing data well, and then uses it to perform inference. This reduces inference and model selection computation time from O(n3 ) to O(m2 n), where m n. Inter-domain GPs can be used to find a (possibly more compact) representative set of features lying in a different domain, at the same computational cost. Being able to specify a different domain for the representative features allows to incorporate prior knowledge about relevant characteristics of data and detaches the functional form of the covariance and basis functions. We will show how previously existing models fit into this framework and will use it to develop two new sparse GP models. Tests on large, representative regression data sets suggest that significant improvement can be achieved, while retaining computational efficiency. 1 Introduction and previous work Along the past decade there has been a growing interest in the application of Gaussian Processes (GPs) to machine learning tasks. GPs are probabilistic non-parametric Bayesian models that combine a number of attractive characteristics: They achieve state-of-the-art performance on supervised learning tasks, provide probabilistic predictions, have a simple and well-founded model selection scheme, present no overfitting (since parameters are integrated out), etc. Unfortunately, the direct application of GPs to regression problems (with which we will be concerned here) is limited due to their training time being O(n3 ). To overcome this limitation, several sparse approximations have been proposed [2, 3, 4, 5, 6]. In most of them, sparsity is achieved by projecting all available data onto a smaller subset of size m n (the active set), which is selected according to some specific criterion. This reduces computation time to O(m2 n). However, active set selection interferes with hyperparameter learning, due to its non-smooth nature (see [1, 3]). These proposals have been superseded by the Sparse Pseudo-inputs GP (SPGP) model, introduced in [1]. In this model, the constraint that the samples of the active set (which are called pseudoinputs) must be selected among training data is relaxed, allowing them to lie anywhere in the input space. This allows both pseudo-inputs and hyperparameters to be selected in a joint continuous optimisation and increases flexibility, resulting in much superior performance. In this work we introduce Inter-Domain GPs (IDGPs) as a general tool to perform inference across domains. This allows to remove the constraint that the pseudo-inputs must remain within the same domain as input data. This added flexibility results in an increased performance and allows to encode prior knowledge about other domains where data can be represented more compactly. 1 2 Review of GPs for regression We will briefly state here the main definitions and results for regression with GPs. See [7] for a comprehensive review. Assume we are given a training set with n samples D ≡ {xj , yj }n , where each D-dimensional j=1 input xj is associated to a scalar output yj . The regression task goal is, given a new input x∗ , predict the corresponding output y∗ based on D. The GP regression model assumes that the outputs can be expressed as some noiseless latent function plus independent noise, y = f (x)+ε, and then sets a zero-mean1 GP prior on f (x), with covariance k(x, x ), and a zero-mean Gaussian prior on ε, with variance σ 2 (the noise power hyperparameter). The covariance function encodes prior knowledge about the smoothness of f (x). The most common choice for it is the Automatic Relevance Determination Squared Exponential (ARD SE): 2 k(x, x ) = σ0 exp − 1 2 D d=1 (xd − xd )2 2 d , (1) 2 with hyperparameters σ0 (the latent function power) and { d }D (the length-scales, defining how d=1 rapidly the covariance decays along each dimension). It is referred to as ARD SE because, when coupled with a model selection method, non-informative input dimensions can be removed automatically by growing the corresponding length-scale. The set of hyperparameters that define the GP are 2 θ = {σ 2 , σ0 , { d }D }. We will omit the dependence on θ for the sake of clarity. d=1 If we evaluate the latent function at X = {xj }n , we obtain a set of latent variables following a j=1 joint Gaussian distribution p(f |X) = N (f |0, Kff ), where [Kff ]ij = k(xi , xj ). Using this model it is possible to express the joint distribution of training and test cases and then condition on the observed outputs to obtain the predictive distribution for any test case pGP (y∗ |x∗ , D) = N (y∗ |kf ∗ (Kff + σ 2 In )−1 y, σ 2 + k∗∗ − kf ∗ (Kff + σ 2 In )−1 kf ∗ ), (2) where y = [y1 , . . . , yn ] , kf ∗ = [k(x1 , x∗ ), . . . , k(xn , x∗ )] , and k∗∗ = k(x∗ , x∗ ). In is used to denote the identity matrix of size n. The O(n3 ) cost of these equations arises from the inversion of the n × n covariance matrix. Predictive distributions for additional test cases take O(n2 ) time each. These costs make standard GPs impractical for large data sets. To select hyperparameters θ, Type-II Maximum Likelihood (ML-II) is commonly used. This amounts to selecting the hyperparameters that correspond to a (possibly local) maximum of the log-marginal likelihood, also called log-evidence. 3 Inter-domain GPs In this section we will introduce Inter-Domain GPs (IDGPs) and show how they can be used as a framework for computationally efficient inference. Then we will use this framework to express two previous relevant models and develop two new ones. 3.1 Definition Consider a real-valued GP f (x) with x ∈ RD and some deterministic real function g(x, z), with z ∈ RH . We define the following transformation: u(z) = f (x)g(x, z)dx. (3) RD There are many examples of transformations that take on this form, the Fourier transform being one of the best known. We will discuss possible choices for g(x, z) in Section 3.3; for the moment we will deal with the general form. Since u(z) is obtained by a linear transformation of GP f (x), 1 We follow the common approach of subtracting the sample mean from the outputs and then assume a zero-mean model. 2 it is also a GP. This new GP may lie in a different domain of possibly different dimension. This transformation is not invertible in general, its properties being defined by g(x, z). IDGPs arise when we jointly consider f (x) and u(z) as a single, “extended” GP. The mean and covariance function of this extended GP are overloaded to accept arguments from both the input and transformed domains and treat them accordingly. We refer to each version of an overloaded function as an instance, which will accept a different type of arguments. If the distribution of the original GP is f (x) ∼ GP(m(x), k(x, x )), then it is possible to compute the remaining instances that define the distribution of the extended GP over both domains. The transformed-domain instance of the mean is m(z) = E[u(z)] = E[f (x)]g(x, z)dx = m(x)g(x, z)dx. RD RD The inter-domain and transformed-domain instances of the covariance function are: k(x, z ) = E[f (x)u(z )] = E f (x) f (x )g(x , z )dx = RD k(z, z ) = E[u(z)u(z )] = E f (x)g(x, z)dx RD = k(x, x )g(x , z )dx f (x )g(x , z )dx RD k(x, x )g(x, z)g(x , z )dxdx . RD (4) RD (5) RD Mean m(·) and covariance function k(·, ·) are therefore defined both by the values and domains of their arguments. This can be seen as if each argument had an additional domain indicator used to select the instance. Apart from that, they define a regular GP, and all standard properties hold. In particular k(a, b) = k(b, a). This approach is related to [8], but here the latent space is defined as a transformation of the input space, and not the other way around. This allows to pre-specify the desired input-domain covariance. The transformation is also more general: Any g(x, z) can be used. We can sample an IDGP at n input-domain points f = [f1 , f2 , . . . , fn ] (with fj = f (xj )) and m transformed-domain points u = [u1 , u2 , . . . , um ] (with ui = u(zi )). With the usual assumption of f (x) being a zero mean GP and defining Z = {zi }m , the joint distribution of these samples is: i=1 Kff Kfu f f 0, X, Z = N p , (6) u u Kfu Kuu with [Kff ]pq = k(xp , xq ), [Kfu ]pq = k(xp , zq ), [Kuu ]pq = k(zp , zq ), which allows to perform inference across domains. We will only be concerned with one input domain and one transformed domain, but IDGPs can be defined for any number of domains. 3.2 Sparse regression using inducing features In the standard regression setting, we are asked to perform inference about the latent function f (x) from a data set D lying in the input domain. Using IDGPs, we can use data from any domain to perform inference in the input domain. Some latent functions might be better defined by a set of data lying in some transformed space rather than in the input space. This idea is used for sparse inference. Following [1] we introduce a pseudo data set, but here we place it in the transformed domain: D = {Z, u}. The following derivation is analogous to that of SPGP. We will refer to Z as the inducing features and u as the inducing variables. The key approximation leading to sparsity is to set m n and assume that f (x) is well-described by the pseudo data set D, so that any two samples (either from the training or test set) fp and fq with p = q will be independent given xp , xq and D. With this simplifying assumption2 , the prior over f can be factorised as a product of marginals: n p(f |X, Z, u) ≈ p(fj |xj , Z, u). (7) j=1 2 Alternatively, (7) can be obtained by proposing a generic factorised form for the approximate conn ditional p(f |X, Z, u) ≈ q(f |X, Z, u) = q (f |xj , Z, u) and then choosing the set of funcj=1 j j tions {qj (·)}n so as to minimise the Kullback-Leibler (KL) divergence from the exact joint prior j=1 KL(p(f |X, Z, u)p(u|Z)||q(f |X, Z, u)p(u|Z)), as noted in [9], Section 2.3.6. 3 Marginals are in turn obtained from (6): p(fj |xj , Z, u) = N (fj |kj K−1 u, λj ), where kj is the j-th uu row of Kfu and λj is the j-th element of the diagonal of matrix Λf = diag(Kf f − Kfu K−1 Kuf ). uu Operator diag(·) sets all off-diagonal elements to zero, so that Λf is a diagonal matrix. Since p(u|Z) is readily available and also Gaussian, the inducing variables can be integrated out from (7), yielding a new, approximate prior over f (x): n p(f |X, Z) = p(fj |xj , Z, u)p(u|Z)du = N (f |0, Kfu K−1 Kuf + Λf ) uu p(f , u|X, Z)du ≈ j=1 Using this approximate prior, the posterior distribution for a test case is: pIDGP (y∗ |x∗ , D, Z) = N (y∗ |ku∗ Q−1 Kfu Λ−1 y, σ 2 + k∗∗ + ku∗ (Q−1 − K−1 )ku∗ ), y uu (8) Kfu Λ−1 Kfu y where we have defined Q = Kuu + and Λy = Λf + σ 2 In . The distribution (2) is approximated by (8) with the information available in the pseudo data set. After O(m2 n) time precomputations, predictive means and variances can be computed in O(m) and O(m2 ) time per test case, respectively. This model is, in general, non-stationary, even when it is approximating a stationary input-domain covariance and can be interpreted as a degenerate GP plus heteroscedastic white noise. The log-marginal likelihood (or log-evidence) of the model, explicitly including the conditioning on kernel hyperparameters θ can be expressed as 1 log p(y|X, Z, θ) = − [y Λ−1 y−y Λ−1 Kfu Q−1 Kfu Λ−1 y+log(|Q||Λy |/|Kuu |)+n log(2π)] y y y 2 which is also computable in O(m2 n) time. Model selection will be performed by jointly optimising the evidence with respect to the hyperparameters and the inducing features. If analytical derivatives of the covariance function are available, conjugate gradient optimisation can be used with O(m2 n) cost per step. 3.3 On the choice of g(x, z) The feature extraction function g(x, z) defines the transformed domain in which the pseudo data set lies. According to (3), the inducing variables can be seen as projections of the target function f (x) on the feature extraction function over the whole input space. Therefore, each of them summarises information about the behaviour of f (x) everywhere. The inducing features Z define the concrete set of functions over which the target function will be projected. It is desirable that this set captures the most significant characteristics of the function. This can be achieved either using prior knowledge about data to select {g(x, zi )}m or using a very general family of functions and letting model i=1 selection automatically choose the appropriate set. Another way to choose g(x, z) relies on the form of the posterior. The posterior mean of a GP is often thought of as a linear combination of “basis functions”. For full GPs and other approximations such as [1, 2, 3, 4, 5, 6], basis functions must have the form of the input-domain covariance function. When using IDGPs, basis functions have the form of the inter-domain instance of the covariance function, and can therefore be adjusted by choosing g(x, z), independently of the input-domain covariance function. If two feature extraction functions g(·, ·) and h(·, ·) can be related by g(x, z) = h(x, z)r(z) for any function r(·), then both yield the same sparse GP model. This property can be used to simplify the expressions of the instances of the covariance function. In this work we use the same functional form for every feature, i.e. our function set is {g(x, zi )}m , i=1 but it is also possible to use sets with different functional forms for each inducing feature, i.e. {gi (x, zi )}m where each zi may even have a different size (dimension). In the sections below i=1 we will discuss different possible choices for g(x, z). 3.3.1 Relation with Sparse GPs using pseudo-inputs The sparse GP using pseudo-inputs (SPGP) was introduced in [1] and was later renamed to Fully Independent Training Conditional (FITC) model to fit in the systematic framework of [10]. Since 4 the sparse model introduced in Section 3.2 also uses a fully independent training conditional, we will stick to the first name to avoid possible confusion. IDGP innovation with respect to SPGP consists in letting the pseudo data set lie in a different domain. If we set gSPGP (x, z) ≡ δ(x − z) where δ(·) is a Dirac delta, we force the pseudo data set to lie in the input domain. Thus there is no longer a transformed space and the original SPGP model is retrieved. In this setting, the inducing features of IDGP play the role of SPGP’s pseudo-inputs. 3.3.2 Relation with Sparse Multiscale GPs Sparse Multiscale GPs (SMGPs) are presented in [11]. Seeking to generalise the SPGP model with ARD SE covariance function, they propose to use a different set of length-scales for each basis function. The resulting model presents a defective variance that is healed by adding heteroscedastic white noise. SMGPs, including the variance improvement, can be derived in a principled way as IDGPs: D 1 gSMGP (x, z) ≡ D d=1 2π(c2 d D kSMGP (x, z ) = exp − d=1 D kSMGP (z, z ) = exp − d=1 − 2) d exp − d=1 (xd − µd )2 2cd2 D µ c with z = 2 d cd2 d=1 (µd − µd )2 2(c2 + cd2 − 2 ) d d (xd − µd )2 2(c2 − 2 ) d d (9) (10) D c2 + d d=1 2 d cd2 − 2. d (11) With this approximation, each basis function has its own centre µ = [µ1 , µ2 , . . . , µd ] and its own length-scales c = [c1 , c2 , . . . , cd ] , whereas global length-scales { d }D are shared by all d=1 inducing features. Equations (10) and (11) are derived from (4) and (5) using (1) and (9). The integrals defining kSMGP (·, ·) converge if and only if c2 ≥ 2 , ∀d , which suggests that other values, d d even if permitted in [11], should be avoided for the model to remain well defined. 3.3.3 Frequency Inducing Features GP If the target function can be described more compactly in the frequency domain than in the input domain, it can be advantageous to let the pseudo data set lie in the former domain. We will pursue that possibility for the case where the input domain covariance is the ARD SE. We will call the resulting sparse model Frequency Inducing Features GP (FIFGP). Directly applying the Fourier transform is not possible because the target function is not square 2 integrable (it has constant power σ0 everywhere, so (5) does not converge). We will workaround this by windowing the target function in the region of interest. It is possible to use a square window, but this results in the covariance being defined in terms of the complex error function, which is very slow to evaluate. Instead, we will use a Gaussian window3 . Since multiplying by a Gaussian in the input domain is equivalent to convolving with a Gaussian in the frequency domain, we will be working with a blurred version of the frequency space. This model is defined by: gFIF (x, z) ≡ D 1 D d=1 2πc2 d D kFIF (x, z ) = exp − d=1 D kFIF (z, z ) = exp − d=1 exp − d=1 x2 d cos ω0 + 2c2 d x2 + c2 ωd2 d d cos ω0 + 2(c2 + 2 ) d d D d=1 exp − d=1 D + exp 3 c4 (ωd + ωd )2 d − 2(2c2 + 2 ) d d d=1 x d ωd with z = ω D 2 d d=1 c2 + d 2 d (13) c4 (ωd − ωd )2 d cos(ω0 − ω0 ) 2(2c2 + 2 ) d d D cos(ω0 + ω0 ) d=1 2 d 2c2 + d 2. d A mixture of m Gaussians could also be used as window without increasing the complexity order. 5 (12) d=1 c2 ωd xd d c2 + 2 d d D 2 c2 (ωd + ωd2 ) d 2(2c2 + 2 ) d d D (14) The inducing features are ω = [ω0 , ω1 , . . . , ωd ] , where ω0 is the phase and the remaining components are frequencies along each dimension. In this model, both global length-scales { d }D and d=1 window length-scales {cd }D are shared, thus cd = cd . Instances (13) and (14) are induced by (12) d=1 using (4) and (5). 3.3.4 Time-Frequency Inducing Features GP Instead of using a single window to select the region of interest, it is possible to use a different window for each feature. We will use windows of the same size but different centres. The resulting model combines SPGP and FIFGP, so we will call it Time-Frequency Inducing Features GP (TFIFGP). It is defined by gTFIF (x, z) ≡ gFIF (x − µ, ω), with z = [µ ω ] . The implied inter-domain and transformed-domain instances of the covariance function are: D kTFIF (x, z ) = kFIF (x − µ , ω ) , kTFIF (z, z ) = kFIF (z, z ) exp − d=1 (µd − µd )2 2(2c2 + 2 ) d d FIFGP is trivially obtained by setting every centre to zero {µi = 0}m , whereas SPGP is obtained i=1 by setting window length-scales c, frequencies and phases {ω i }m to zero. If the window lengthi=1 scales were individually adjusted, SMGP would be obtained. While FIFGP has the modelling power of both FIFGP and SPGP, it might perform worse in practice due to it having roughly twice as many hyperparameters, thus making the optimisation problem harder. The same problem also exists in SMGP. A possible workaround is to initialise the hyperparameters using a simpler model, as done in [11] for SMGP, though we will not do this here. 4 Experiments In this section we will compare the proposed approximations FIFGP and TFIFGP with the current state of the art, SPGP on some large data sets, for the same number of inducing features/inputs and therefore, roughly equal computational cost. Additionally, we provide results using a full GP, which is expected to provide top performance (though requiring an impractically big amount of computation). In all cases, the (input-domain) covariance function is the ARD SE (1). We use four large data sets: Kin-40k, Pumadyn-32nm4 (describing the dynamics of a robot arm, used with SPGP in [1]), Elevators and Pole Telecomm5 (related to the control of the elevators of an F16 aircraft and a telecommunications problem, and used in [12, 13, 14]). Input dimensions that remained constant throughout the training set were removed. Input data was additionally centred for use with FIFGP (the remaining methods are translation invariant). Pole Telecomm outputs actually take discrete values in the 0-100 range, in multiples of 10. This was taken into account by using the corresponding quantization noise variance (102 /12) as lower bound for the noise hyperparameter6 . n 1 2 2 2 Hyperparameters are initialised as follows: σ0 = n j=1 yj , σ 2 = σ0 /4, { d }D to one half of d=1 the range spanned by training data along each dimension. For SPGP, pseudo-inputs are initialised to a random subset of the training data, for FIFGP window size c is initialised to the standard deviation of input data, frequencies are randomly chosen from a zero-mean −2 -variance Gaussian d distribution, and phases are obtained from a uniform distribution in [0 . . . 2π). TFIFGP uses the same initialisation as FIFGP, with window centres set to zero. Final values are selected by evidence maximisation. Denoting the output average over the training set as y and the predictive mean and variance for test sample y∗l as µ∗l and σ∗l respectively, we define the following quality measures: Normalized Mean Square Error (NMSE) (y∗l − µ∗l )2 / (y∗l − y)2 and Mean Negative Log-Probability (MNLP) 1 2 2 2 2 (y∗l − µ∗l ) /σ∗l + log σ∗l + log 2π , where · averages over the test set. 4 Kin-40k: 8 input dimensions, 10000/30000 samples for train/test, Pumadyn-32nm: 32 input dimensions, 7168/1024 samples for train/test, using exactly the same preprocessing and train/test splits as [1, 3]. Note that their error measure is actually one half of the Normalized Mean Square Error defined here. 5 Pole Telecomm: 26 non-constant input dimensions, 10000/5000 samples for train/test. Elevators: 17 non-constant input dimensions, 8752/7847 samples for train/test. Both have been downloaded from http://www.liaad.up.pt/∼ltorgo/Regression/datasets.html 6 If unconstrained, similar plots are obtained; in particular, no overfitting is observed. 6 For Kin-40k (Fig. 1, top), all three sparse methods perform similarly, though for high sparseness (the most useful case) FIFGP and TFIFGP are slightly superior. In Pumadyn-32nm (Fig. 1, bottom), only 4 out the 32 input dimensions are relevant to the regression task, so it can be used as an ARD capabilities test. We follow [1] and use a full GP on a small subset of the training data (1024 data points) to obtain the initial length-scales. This allows better minima to be found during optimisation. Though all methods are able to properly find a good solution, FIFGP and especially TFIFGP are better in the sparser regime. Roughly the same considerations can be made about Pole Telecomm and Elevators (Fig. 2), but in these data sets the superiority of FIFGP and TFIFGP is more dramatic. Though not shown here, we have additionally tested these models on smaller, overfitting-prone data sets, and have found no noticeable overfitting even using m > n, despite the relatively high number of parameters being adjusted. This is in line with the results and discussion of [1]. Mean Negative Log−Probability Normalized Mean Squared Error 2.5 0.5 0.1 0.05 0.01 0.005 SPGP FIFGP TFIFGP Full GP on 10000 data points 0.001 25 50 100 200 300 500 750 SPGP FIFGP TFIFGP Full GP on 10000 data points 2 1.5 1 0.5 0 −0.5 −1 −1.5 1250 25 50 100 200 300 500 750 1250 Inducing features / pseudo−inputs (b) Kin-40k MNLP (semilog plot) Inducing features / pseudo−inputs (a) Kin-40k NMSE (log-log plot) 0.05 0.04 10 25 50 75 Mean Negative Log−Probability Normalized Mean Squared Error 0.2 SPGP FIFGP TFIFGP Full GP on 7168 data points 0.1 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 100 Inducing features / pseudo−inputs (c) Pumadyn-32nm NMSE (log-log plot) SPGP FIFGP TFIFGP Full GP on 7168 data points 0.15 10 25 50 75 100 Inducing features / pseudo−inputs (d) Pumadyn-32nm MNLP (semilog plot) Figure 1: Performance of the compared methods on Kin-40k and Pumadyn-32nm. 5 Conclusions and extensions In this work we have introduced IDGPs, which are able combine representations of a GP in different domains, and have used them to extend SPGP to handle inducing features lying in a different domain. This provides a general framework for sparse models, which are defined by a feature extraction function. Using this framework, SMGPs can be reinterpreted as fully principled models using a transformed space of local features, without any need for post-hoc variance improvements. Furthermore, it is possible to develop new sparse models of practical use, such as the proposed FIFGP and TFIFGP, which are able to outperform the state-of-the-art SPGP on some large data sets, especially for high sparsity regimes. 7 0.25 0.2 0.15 0.1 10 25 50 100 250 Mean Negative Log−Probability Normalized Mean Squared Error −3.8 SPGP FIFGP TFIFGP Full GP on 8752 data points SPGP FIFGP TFIFGP Full GP on 8752 data points −4 −4.2 −4.4 −4.6 −4.8 500 750 1000 10 Inducing features / pseudo−inputs (a) Elevators NMSE (log-log plot) 25 50 100 250 500 750 1000 Inducing features / pseudo−inputs (b) Elevators MNLP (semilog plot) 0.2 0.15 0.1 0.05 0.04 0.03 0.02 0.01 10 25 50 100 250 500 Mean Negative Log−Probability Normalized Mean Squared Error 5.5 SPGP FIFGP TFIFGP Full GP on 10000 data points 4.5 4 3.5 3 2.5 1000 SPGP FIFGP TFIFGP Full GP on 10000 data points 5 10 25 50 100 250 500 1000 Inducing features / pseudo−inputs (d) Pole Telecomm MNLP (semilog plot) Inducing features / pseudo−inputs (c) Pole Telecomm NMSE (log-log plot) Figure 2: Performance of the compared methods on Elevators and Pole Telecomm. Choosing a transformed space for the inducing features enables to use domains where the target function can be expressed more compactly, or where the evidence (which is a function of the features) is easier to optimise. This added flexibility translates as a detaching of the functional form of the input-domain covariance and the set of basis functions used to express the posterior mean. IDGPs approximate full GPs optimally in the KL sense noted in Section 3.2, for a given set of inducing features. Using ML-II to select the inducing features means that models providing a good fit to data are given preference over models that might approximate the full GP more closely. This, though rarely, might lead to harmful overfitting. To more faithfully approximate the full GP and avoid overfitting altogether, our proposal can be combined with the variational approach from [15], in which the inducing features would be regarded as variational parameters. This would result in more constrained models, which would be closer to the full GP but might show reduced performance. We have explored the case of regression with Gaussian noise, which is analytically tractable, but it is straightforward to apply the same model to other tasks such as robust regression or classification, using approximate inference (see [16]). Also, IDGPs as a general tool can be used for other purposes, such as modelling noise in the frequency domain, aggregating data from different domains or even imposing constraints on the target function. Acknowledgments We would like to thank the anonymous referees for helpful comments and suggestions. This work has been partly supported by the Spanish government under grant TEC2008- 02473/TEC, and by the Madrid Community under grant S-505/TIC/0223. 8 References [1] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, pages 1259–1266. MIT Press, 2006. [2] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems 13, pages 619–625. MIT Press, 2001. [3] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Proceedings of the 9th International Workshop on AI Stats, 2003. [4] V. Tresp. A Bayesian committee machine. Neural Computation, 12:2719–2741, 2000. [5] L. Csat´ and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(3):641–669, 2002. o [6] C. K. I. Williams and M. Seeger. Using the Nystr¨ m method to speed up kernel machines. In Advances o in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001. [7] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2006. [8] M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In Advances in Neural Information Processing Systems 21, pages 57–64, 2009. [9] Ed. Snelson. Flexible and efficient Gaussian process models for machine learning. PhD thesis, University of Cambridge, 2007. [10] J. Qui˜ onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process n regression. Journal of Machine Learning Research, 6:1939–1959, 2005. [11] C. Walder, K. I. Kim, and B. Sch¨ lkopf. Sparse multiscale Gaussian process regression. In 25th Internao tional Conference on Machine Learning. ACM Press, New York, 2008. [12] G. Potgietera and A. P. Engelbrecht. Evolving model trees for mining data sets with continuous-valued classes. Expert Systems with Applications, 35:1513–1532, 2007. [13] L. Torgo and J. Pinto da Costa. Clustered partial linear regression. In Proceedings of the 11th European Conference on Machine Learning, pages 426–436. Springer, 2000. [14] G. Potgietera and A. P. Engelbrecht. Pairwise classification as an ensemble technique. In Proceedings of the 13th European Conference on Machine Learning, pages 97–110. Springer-Verlag, 2002. [15] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Workshop on AI Stats, 2009. [16] A. Naish-Guzman and S. Holden. The generalized FITC approximation. In Advances in Neural Information Processing Systems 20, pages 1057–1064. MIT Press, 2008. 9
4 0.13458622 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals
Author: Sebastian Gerwinn, Philipp Berens, Matthias Bethge
Abstract: Second-order maximum-entropy models have recently gained much interest for describing the statistics of binary spike trains. Here, we extend this approach to take continuous stimuli into account as well. By constraining the joint secondorder statistics, we obtain a joint Gaussian-Boltzmann distribution of continuous stimuli and binary neural firing patterns, for which we also compute marginal and conditional distributions. This model has the same computational complexity as pure binary models and fitting it to data is a convex problem. We show that the model can be seen as an extension to the classical spike-triggered average/covariance analysis and can be used as a non-linear method for extracting features which a neural population is sensitive to. Further, by calculating the posterior distribution of stimuli given an observed neural response, the model can be used to decode stimuli and yields a natural spike-train metric. Therefore, extending the framework of maximum-entropy models to continuous variables allows us to gain novel insights into the relationship between the firing patterns of neural ensembles and the stimuli they are processing. 1
5 0.12212279 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
Author: Lei Shi, Thomas L. Griffiths
Abstract: The goal of perception is to infer the hidden states in the hierarchical process by which sensory data are generated. Human behavior is consistent with the optimal statistical solution to this problem in many tasks, including cue combination and orientation detection. Understanding the neural mechanisms underlying this behavior is of particular importance, since probabilistic computations are notoriously challenging. Here we propose a simple mechanism for Bayesian inference which involves averaging over a few feature detection neurons which fire at a rate determined by their similarity to a sensory stimulus. This mechanism is based on a Monte Carlo method known as importance sampling, commonly used in computer science and statistics. Moreover, a simple extension to recursive importance sampling can be used to perform hierarchical Bayesian inference. We identify a scheme for implementing importance sampling with spiking neurons, and show that this scheme can account for human behavior in cue combination and the oblique effect. 1
6 0.11896416 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing
7 0.11264502 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing
8 0.11110144 100 nips-2009-Gaussian process regression with Student-t likelihood
9 0.10806063 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
10 0.1060544 163 nips-2009-Neurometric function analysis of population codes
11 0.10498799 200 nips-2009-Reconstruction of Sparse Circuits Using Multi-neuronal Excitation (RESCUME)
12 0.10338512 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity
13 0.10253958 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
14 0.099219874 70 nips-2009-Discriminative Network Models of Schizophrenia
15 0.095840752 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes
16 0.089023657 164 nips-2009-No evidence for active sparsification in the visual cortex
17 0.08467181 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection
18 0.082989834 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording
19 0.0797306 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
20 0.077617317 47 nips-2009-Boosting with Spatial Regularization
topicId topicWeight
[(0, -0.232), (1, -0.165), (2, 0.091), (3, 0.122), (4, 0.049), (5, -0.013), (6, 0.092), (7, -0.068), (8, -0.022), (9, -0.124), (10, -0.035), (11, -0.027), (12, 0.01), (13, 0.079), (14, -0.002), (15, -0.012), (16, -0.117), (17, 0.247), (18, -0.051), (19, -0.164), (20, -0.068), (21, -0.014), (22, -0.03), (23, -0.094), (24, 0.029), (25, 0.024), (26, -0.118), (27, -0.103), (28, 0.09), (29, 0.009), (30, -0.055), (31, 0.114), (32, -0.004), (33, 0.165), (34, 0.003), (35, -0.038), (36, -0.034), (37, 0.048), (38, -0.015), (39, 0.085), (40, 0.05), (41, -0.002), (42, -0.071), (43, -0.001), (44, 0.05), (45, -0.03), (46, 0.0), (47, -0.072), (48, 0.015), (49, -0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.97178203 43 nips-2009-Bayesian estimation of orientation preference maps
Author: Sebastian Gerwinn, Leonard White, Matthias Kaschube, Matthias Bethge, Jakob H. Macke
Abstract: Imaging techniques such as optical imaging of intrinsic signals, 2-photon calcium imaging and voltage sensitive dye imaging can be used to measure the functional organization of visual cortex across different spatial and temporal scales. Here, we present Bayesian methods based on Gaussian processes for extracting topographic maps from functional imaging data. In particular, we focus on the estimation of orientation preference maps (OPMs) from intrinsic signal imaging data. We model the underlying map as a bivariate Gaussian process, with a prior covariance function that reflects known properties of OPMs, and a noise covariance adjusted to the data. The posterior mean can be interpreted as an optimally smoothed estimate of the map, and can be used for model based interpolations of the map from sparse measurements. By sampling from the posterior distribution, we can get error bars on statistical properties such as preferred orientations, pinwheel locations or pinwheel counts. Finally, the use of an explicit probabilistic model facilitates interpretation of parameters and quantitative model comparisons. We demonstrate our model both on simulated data and on intrinsic signaling data from ferret visual cortex. 1
2 0.82181364 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
Author: Jaakko Luttinen, Alexander T. Ihler
Abstract: We present a probabilistic factor analysis model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.
3 0.76951486 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features
Author: Anibal Figueiras-vidal, Miguel Lázaro-gredilla
Abstract: We present a general inference framework for inter-domain Gaussian Processes (GPs) and focus on its usefulness to build sparse GP models. The state-of-the-art sparse GP model introduced by Snelson and Ghahramani in [1] relies on finding a small, representative pseudo data set of m elements (from the same domain as the n available data elements) which is able to explain existing data well, and then uses it to perform inference. This reduces inference and model selection computation time from O(n3 ) to O(m2 n), where m n. Inter-domain GPs can be used to find a (possibly more compact) representative set of features lying in a different domain, at the same computational cost. Being able to specify a different domain for the representative features allows to incorporate prior knowledge about relevant characteristics of data and detaches the functional form of the covariance and basis functions. We will show how previously existing models fit into this framework and will use it to develop two new sparse GP models. Tests on large, representative regression data sets suggest that significant improvement can be achieved, while retaining computational efficiency. 1 Introduction and previous work Along the past decade there has been a growing interest in the application of Gaussian Processes (GPs) to machine learning tasks. GPs are probabilistic non-parametric Bayesian models that combine a number of attractive characteristics: They achieve state-of-the-art performance on supervised learning tasks, provide probabilistic predictions, have a simple and well-founded model selection scheme, present no overfitting (since parameters are integrated out), etc. Unfortunately, the direct application of GPs to regression problems (with which we will be concerned here) is limited due to their training time being O(n3 ). To overcome this limitation, several sparse approximations have been proposed [2, 3, 4, 5, 6]. In most of them, sparsity is achieved by projecting all available data onto a smaller subset of size m n (the active set), which is selected according to some specific criterion. This reduces computation time to O(m2 n). However, active set selection interferes with hyperparameter learning, due to its non-smooth nature (see [1, 3]). These proposals have been superseded by the Sparse Pseudo-inputs GP (SPGP) model, introduced in [1]. In this model, the constraint that the samples of the active set (which are called pseudoinputs) must be selected among training data is relaxed, allowing them to lie anywhere in the input space. This allows both pseudo-inputs and hyperparameters to be selected in a joint continuous optimisation and increases flexibility, resulting in much superior performance. In this work we introduce Inter-Domain GPs (IDGPs) as a general tool to perform inference across domains. This allows to remove the constraint that the pseudo-inputs must remain within the same domain as input data. This added flexibility results in an increased performance and allows to encode prior knowledge about other domains where data can be represented more compactly. 1 2 Review of GPs for regression We will briefly state here the main definitions and results for regression with GPs. See [7] for a comprehensive review. Assume we are given a training set with n samples D ≡ {xj , yj }n , where each D-dimensional j=1 input xj is associated to a scalar output yj . The regression task goal is, given a new input x∗ , predict the corresponding output y∗ based on D. The GP regression model assumes that the outputs can be expressed as some noiseless latent function plus independent noise, y = f (x)+ε, and then sets a zero-mean1 GP prior on f (x), with covariance k(x, x ), and a zero-mean Gaussian prior on ε, with variance σ 2 (the noise power hyperparameter). The covariance function encodes prior knowledge about the smoothness of f (x). The most common choice for it is the Automatic Relevance Determination Squared Exponential (ARD SE): 2 k(x, x ) = σ0 exp − 1 2 D d=1 (xd − xd )2 2 d , (1) 2 with hyperparameters σ0 (the latent function power) and { d }D (the length-scales, defining how d=1 rapidly the covariance decays along each dimension). It is referred to as ARD SE because, when coupled with a model selection method, non-informative input dimensions can be removed automatically by growing the corresponding length-scale. The set of hyperparameters that define the GP are 2 θ = {σ 2 , σ0 , { d }D }. We will omit the dependence on θ for the sake of clarity. d=1 If we evaluate the latent function at X = {xj }n , we obtain a set of latent variables following a j=1 joint Gaussian distribution p(f |X) = N (f |0, Kff ), where [Kff ]ij = k(xi , xj ). Using this model it is possible to express the joint distribution of training and test cases and then condition on the observed outputs to obtain the predictive distribution for any test case pGP (y∗ |x∗ , D) = N (y∗ |kf ∗ (Kff + σ 2 In )−1 y, σ 2 + k∗∗ − kf ∗ (Kff + σ 2 In )−1 kf ∗ ), (2) where y = [y1 , . . . , yn ] , kf ∗ = [k(x1 , x∗ ), . . . , k(xn , x∗ )] , and k∗∗ = k(x∗ , x∗ ). In is used to denote the identity matrix of size n. The O(n3 ) cost of these equations arises from the inversion of the n × n covariance matrix. Predictive distributions for additional test cases take O(n2 ) time each. These costs make standard GPs impractical for large data sets. To select hyperparameters θ, Type-II Maximum Likelihood (ML-II) is commonly used. This amounts to selecting the hyperparameters that correspond to a (possibly local) maximum of the log-marginal likelihood, also called log-evidence. 3 Inter-domain GPs In this section we will introduce Inter-Domain GPs (IDGPs) and show how they can be used as a framework for computationally efficient inference. Then we will use this framework to express two previous relevant models and develop two new ones. 3.1 Definition Consider a real-valued GP f (x) with x ∈ RD and some deterministic real function g(x, z), with z ∈ RH . We define the following transformation: u(z) = f (x)g(x, z)dx. (3) RD There are many examples of transformations that take on this form, the Fourier transform being one of the best known. We will discuss possible choices for g(x, z) in Section 3.3; for the moment we will deal with the general form. Since u(z) is obtained by a linear transformation of GP f (x), 1 We follow the common approach of subtracting the sample mean from the outputs and then assume a zero-mean model. 2 it is also a GP. This new GP may lie in a different domain of possibly different dimension. This transformation is not invertible in general, its properties being defined by g(x, z). IDGPs arise when we jointly consider f (x) and u(z) as a single, “extended” GP. The mean and covariance function of this extended GP are overloaded to accept arguments from both the input and transformed domains and treat them accordingly. We refer to each version of an overloaded function as an instance, which will accept a different type of arguments. If the distribution of the original GP is f (x) ∼ GP(m(x), k(x, x )), then it is possible to compute the remaining instances that define the distribution of the extended GP over both domains. The transformed-domain instance of the mean is m(z) = E[u(z)] = E[f (x)]g(x, z)dx = m(x)g(x, z)dx. RD RD The inter-domain and transformed-domain instances of the covariance function are: k(x, z ) = E[f (x)u(z )] = E f (x) f (x )g(x , z )dx = RD k(z, z ) = E[u(z)u(z )] = E f (x)g(x, z)dx RD = k(x, x )g(x , z )dx f (x )g(x , z )dx RD k(x, x )g(x, z)g(x , z )dxdx . RD (4) RD (5) RD Mean m(·) and covariance function k(·, ·) are therefore defined both by the values and domains of their arguments. This can be seen as if each argument had an additional domain indicator used to select the instance. Apart from that, they define a regular GP, and all standard properties hold. In particular k(a, b) = k(b, a). This approach is related to [8], but here the latent space is defined as a transformation of the input space, and not the other way around. This allows to pre-specify the desired input-domain covariance. The transformation is also more general: Any g(x, z) can be used. We can sample an IDGP at n input-domain points f = [f1 , f2 , . . . , fn ] (with fj = f (xj )) and m transformed-domain points u = [u1 , u2 , . . . , um ] (with ui = u(zi )). With the usual assumption of f (x) being a zero mean GP and defining Z = {zi }m , the joint distribution of these samples is: i=1 Kff Kfu f f 0, X, Z = N p , (6) u u Kfu Kuu with [Kff ]pq = k(xp , xq ), [Kfu ]pq = k(xp , zq ), [Kuu ]pq = k(zp , zq ), which allows to perform inference across domains. We will only be concerned with one input domain and one transformed domain, but IDGPs can be defined for any number of domains. 3.2 Sparse regression using inducing features In the standard regression setting, we are asked to perform inference about the latent function f (x) from a data set D lying in the input domain. Using IDGPs, we can use data from any domain to perform inference in the input domain. Some latent functions might be better defined by a set of data lying in some transformed space rather than in the input space. This idea is used for sparse inference. Following [1] we introduce a pseudo data set, but here we place it in the transformed domain: D = {Z, u}. The following derivation is analogous to that of SPGP. We will refer to Z as the inducing features and u as the inducing variables. The key approximation leading to sparsity is to set m n and assume that f (x) is well-described by the pseudo data set D, so that any two samples (either from the training or test set) fp and fq with p = q will be independent given xp , xq and D. With this simplifying assumption2 , the prior over f can be factorised as a product of marginals: n p(f |X, Z, u) ≈ p(fj |xj , Z, u). (7) j=1 2 Alternatively, (7) can be obtained by proposing a generic factorised form for the approximate conn ditional p(f |X, Z, u) ≈ q(f |X, Z, u) = q (f |xj , Z, u) and then choosing the set of funcj=1 j j tions {qj (·)}n so as to minimise the Kullback-Leibler (KL) divergence from the exact joint prior j=1 KL(p(f |X, Z, u)p(u|Z)||q(f |X, Z, u)p(u|Z)), as noted in [9], Section 2.3.6. 3 Marginals are in turn obtained from (6): p(fj |xj , Z, u) = N (fj |kj K−1 u, λj ), where kj is the j-th uu row of Kfu and λj is the j-th element of the diagonal of matrix Λf = diag(Kf f − Kfu K−1 Kuf ). uu Operator diag(·) sets all off-diagonal elements to zero, so that Λf is a diagonal matrix. Since p(u|Z) is readily available and also Gaussian, the inducing variables can be integrated out from (7), yielding a new, approximate prior over f (x): n p(f |X, Z) = p(fj |xj , Z, u)p(u|Z)du = N (f |0, Kfu K−1 Kuf + Λf ) uu p(f , u|X, Z)du ≈ j=1 Using this approximate prior, the posterior distribution for a test case is: pIDGP (y∗ |x∗ , D, Z) = N (y∗ |ku∗ Q−1 Kfu Λ−1 y, σ 2 + k∗∗ + ku∗ (Q−1 − K−1 )ku∗ ), y uu (8) Kfu Λ−1 Kfu y where we have defined Q = Kuu + and Λy = Λf + σ 2 In . The distribution (2) is approximated by (8) with the information available in the pseudo data set. After O(m2 n) time precomputations, predictive means and variances can be computed in O(m) and O(m2 ) time per test case, respectively. This model is, in general, non-stationary, even when it is approximating a stationary input-domain covariance and can be interpreted as a degenerate GP plus heteroscedastic white noise. The log-marginal likelihood (or log-evidence) of the model, explicitly including the conditioning on kernel hyperparameters θ can be expressed as 1 log p(y|X, Z, θ) = − [y Λ−1 y−y Λ−1 Kfu Q−1 Kfu Λ−1 y+log(|Q||Λy |/|Kuu |)+n log(2π)] y y y 2 which is also computable in O(m2 n) time. Model selection will be performed by jointly optimising the evidence with respect to the hyperparameters and the inducing features. If analytical derivatives of the covariance function are available, conjugate gradient optimisation can be used with O(m2 n) cost per step. 3.3 On the choice of g(x, z) The feature extraction function g(x, z) defines the transformed domain in which the pseudo data set lies. According to (3), the inducing variables can be seen as projections of the target function f (x) on the feature extraction function over the whole input space. Therefore, each of them summarises information about the behaviour of f (x) everywhere. The inducing features Z define the concrete set of functions over which the target function will be projected. It is desirable that this set captures the most significant characteristics of the function. This can be achieved either using prior knowledge about data to select {g(x, zi )}m or using a very general family of functions and letting model i=1 selection automatically choose the appropriate set. Another way to choose g(x, z) relies on the form of the posterior. The posterior mean of a GP is often thought of as a linear combination of “basis functions”. For full GPs and other approximations such as [1, 2, 3, 4, 5, 6], basis functions must have the form of the input-domain covariance function. When using IDGPs, basis functions have the form of the inter-domain instance of the covariance function, and can therefore be adjusted by choosing g(x, z), independently of the input-domain covariance function. If two feature extraction functions g(·, ·) and h(·, ·) can be related by g(x, z) = h(x, z)r(z) for any function r(·), then both yield the same sparse GP model. This property can be used to simplify the expressions of the instances of the covariance function. In this work we use the same functional form for every feature, i.e. our function set is {g(x, zi )}m , i=1 but it is also possible to use sets with different functional forms for each inducing feature, i.e. {gi (x, zi )}m where each zi may even have a different size (dimension). In the sections below i=1 we will discuss different possible choices for g(x, z). 3.3.1 Relation with Sparse GPs using pseudo-inputs The sparse GP using pseudo-inputs (SPGP) was introduced in [1] and was later renamed to Fully Independent Training Conditional (FITC) model to fit in the systematic framework of [10]. Since 4 the sparse model introduced in Section 3.2 also uses a fully independent training conditional, we will stick to the first name to avoid possible confusion. IDGP innovation with respect to SPGP consists in letting the pseudo data set lie in a different domain. If we set gSPGP (x, z) ≡ δ(x − z) where δ(·) is a Dirac delta, we force the pseudo data set to lie in the input domain. Thus there is no longer a transformed space and the original SPGP model is retrieved. In this setting, the inducing features of IDGP play the role of SPGP’s pseudo-inputs. 3.3.2 Relation with Sparse Multiscale GPs Sparse Multiscale GPs (SMGPs) are presented in [11]. Seeking to generalise the SPGP model with ARD SE covariance function, they propose to use a different set of length-scales for each basis function. The resulting model presents a defective variance that is healed by adding heteroscedastic white noise. SMGPs, including the variance improvement, can be derived in a principled way as IDGPs: D 1 gSMGP (x, z) ≡ D d=1 2π(c2 d D kSMGP (x, z ) = exp − d=1 D kSMGP (z, z ) = exp − d=1 − 2) d exp − d=1 (xd − µd )2 2cd2 D µ c with z = 2 d cd2 d=1 (µd − µd )2 2(c2 + cd2 − 2 ) d d (xd − µd )2 2(c2 − 2 ) d d (9) (10) D c2 + d d=1 2 d cd2 − 2. d (11) With this approximation, each basis function has its own centre µ = [µ1 , µ2 , . . . , µd ] and its own length-scales c = [c1 , c2 , . . . , cd ] , whereas global length-scales { d }D are shared by all d=1 inducing features. Equations (10) and (11) are derived from (4) and (5) using (1) and (9). The integrals defining kSMGP (·, ·) converge if and only if c2 ≥ 2 , ∀d , which suggests that other values, d d even if permitted in [11], should be avoided for the model to remain well defined. 3.3.3 Frequency Inducing Features GP If the target function can be described more compactly in the frequency domain than in the input domain, it can be advantageous to let the pseudo data set lie in the former domain. We will pursue that possibility for the case where the input domain covariance is the ARD SE. We will call the resulting sparse model Frequency Inducing Features GP (FIFGP). Directly applying the Fourier transform is not possible because the target function is not square 2 integrable (it has constant power σ0 everywhere, so (5) does not converge). We will workaround this by windowing the target function in the region of interest. It is possible to use a square window, but this results in the covariance being defined in terms of the complex error function, which is very slow to evaluate. Instead, we will use a Gaussian window3 . Since multiplying by a Gaussian in the input domain is equivalent to convolving with a Gaussian in the frequency domain, we will be working with a blurred version of the frequency space. This model is defined by: gFIF (x, z) ≡ D 1 D d=1 2πc2 d D kFIF (x, z ) = exp − d=1 D kFIF (z, z ) = exp − d=1 exp − d=1 x2 d cos ω0 + 2c2 d x2 + c2 ωd2 d d cos ω0 + 2(c2 + 2 ) d d D d=1 exp − d=1 D + exp 3 c4 (ωd + ωd )2 d − 2(2c2 + 2 ) d d d=1 x d ωd with z = ω D 2 d d=1 c2 + d 2 d (13) c4 (ωd − ωd )2 d cos(ω0 − ω0 ) 2(2c2 + 2 ) d d D cos(ω0 + ω0 ) d=1 2 d 2c2 + d 2. d A mixture of m Gaussians could also be used as window without increasing the complexity order. 5 (12) d=1 c2 ωd xd d c2 + 2 d d D 2 c2 (ωd + ωd2 ) d 2(2c2 + 2 ) d d D (14) The inducing features are ω = [ω0 , ω1 , . . . , ωd ] , where ω0 is the phase and the remaining components are frequencies along each dimension. In this model, both global length-scales { d }D and d=1 window length-scales {cd }D are shared, thus cd = cd . Instances (13) and (14) are induced by (12) d=1 using (4) and (5). 3.3.4 Time-Frequency Inducing Features GP Instead of using a single window to select the region of interest, it is possible to use a different window for each feature. We will use windows of the same size but different centres. The resulting model combines SPGP and FIFGP, so we will call it Time-Frequency Inducing Features GP (TFIFGP). It is defined by gTFIF (x, z) ≡ gFIF (x − µ, ω), with z = [µ ω ] . The implied inter-domain and transformed-domain instances of the covariance function are: D kTFIF (x, z ) = kFIF (x − µ , ω ) , kTFIF (z, z ) = kFIF (z, z ) exp − d=1 (µd − µd )2 2(2c2 + 2 ) d d FIFGP is trivially obtained by setting every centre to zero {µi = 0}m , whereas SPGP is obtained i=1 by setting window length-scales c, frequencies and phases {ω i }m to zero. If the window lengthi=1 scales were individually adjusted, SMGP would be obtained. While FIFGP has the modelling power of both FIFGP and SPGP, it might perform worse in practice due to it having roughly twice as many hyperparameters, thus making the optimisation problem harder. The same problem also exists in SMGP. A possible workaround is to initialise the hyperparameters using a simpler model, as done in [11] for SMGP, though we will not do this here. 4 Experiments In this section we will compare the proposed approximations FIFGP and TFIFGP with the current state of the art, SPGP on some large data sets, for the same number of inducing features/inputs and therefore, roughly equal computational cost. Additionally, we provide results using a full GP, which is expected to provide top performance (though requiring an impractically big amount of computation). In all cases, the (input-domain) covariance function is the ARD SE (1). We use four large data sets: Kin-40k, Pumadyn-32nm4 (describing the dynamics of a robot arm, used with SPGP in [1]), Elevators and Pole Telecomm5 (related to the control of the elevators of an F16 aircraft and a telecommunications problem, and used in [12, 13, 14]). Input dimensions that remained constant throughout the training set were removed. Input data was additionally centred for use with FIFGP (the remaining methods are translation invariant). Pole Telecomm outputs actually take discrete values in the 0-100 range, in multiples of 10. This was taken into account by using the corresponding quantization noise variance (102 /12) as lower bound for the noise hyperparameter6 . n 1 2 2 2 Hyperparameters are initialised as follows: σ0 = n j=1 yj , σ 2 = σ0 /4, { d }D to one half of d=1 the range spanned by training data along each dimension. For SPGP, pseudo-inputs are initialised to a random subset of the training data, for FIFGP window size c is initialised to the standard deviation of input data, frequencies are randomly chosen from a zero-mean −2 -variance Gaussian d distribution, and phases are obtained from a uniform distribution in [0 . . . 2π). TFIFGP uses the same initialisation as FIFGP, with window centres set to zero. Final values are selected by evidence maximisation. Denoting the output average over the training set as y and the predictive mean and variance for test sample y∗l as µ∗l and σ∗l respectively, we define the following quality measures: Normalized Mean Square Error (NMSE) (y∗l − µ∗l )2 / (y∗l − y)2 and Mean Negative Log-Probability (MNLP) 1 2 2 2 2 (y∗l − µ∗l ) /σ∗l + log σ∗l + log 2π , where · averages over the test set. 4 Kin-40k: 8 input dimensions, 10000/30000 samples for train/test, Pumadyn-32nm: 32 input dimensions, 7168/1024 samples for train/test, using exactly the same preprocessing and train/test splits as [1, 3]. Note that their error measure is actually one half of the Normalized Mean Square Error defined here. 5 Pole Telecomm: 26 non-constant input dimensions, 10000/5000 samples for train/test. Elevators: 17 non-constant input dimensions, 8752/7847 samples for train/test. Both have been downloaded from http://www.liaad.up.pt/∼ltorgo/Regression/datasets.html 6 If unconstrained, similar plots are obtained; in particular, no overfitting is observed. 6 For Kin-40k (Fig. 1, top), all three sparse methods perform similarly, though for high sparseness (the most useful case) FIFGP and TFIFGP are slightly superior. In Pumadyn-32nm (Fig. 1, bottom), only 4 out the 32 input dimensions are relevant to the regression task, so it can be used as an ARD capabilities test. We follow [1] and use a full GP on a small subset of the training data (1024 data points) to obtain the initial length-scales. This allows better minima to be found during optimisation. Though all methods are able to properly find a good solution, FIFGP and especially TFIFGP are better in the sparser regime. Roughly the same considerations can be made about Pole Telecomm and Elevators (Fig. 2), but in these data sets the superiority of FIFGP and TFIFGP is more dramatic. Though not shown here, we have additionally tested these models on smaller, overfitting-prone data sets, and have found no noticeable overfitting even using m > n, despite the relatively high number of parameters being adjusted. This is in line with the results and discussion of [1]. Mean Negative Log−Probability Normalized Mean Squared Error 2.5 0.5 0.1 0.05 0.01 0.005 SPGP FIFGP TFIFGP Full GP on 10000 data points 0.001 25 50 100 200 300 500 750 SPGP FIFGP TFIFGP Full GP on 10000 data points 2 1.5 1 0.5 0 −0.5 −1 −1.5 1250 25 50 100 200 300 500 750 1250 Inducing features / pseudo−inputs (b) Kin-40k MNLP (semilog plot) Inducing features / pseudo−inputs (a) Kin-40k NMSE (log-log plot) 0.05 0.04 10 25 50 75 Mean Negative Log−Probability Normalized Mean Squared Error 0.2 SPGP FIFGP TFIFGP Full GP on 7168 data points 0.1 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 100 Inducing features / pseudo−inputs (c) Pumadyn-32nm NMSE (log-log plot) SPGP FIFGP TFIFGP Full GP on 7168 data points 0.15 10 25 50 75 100 Inducing features / pseudo−inputs (d) Pumadyn-32nm MNLP (semilog plot) Figure 1: Performance of the compared methods on Kin-40k and Pumadyn-32nm. 5 Conclusions and extensions In this work we have introduced IDGPs, which are able combine representations of a GP in different domains, and have used them to extend SPGP to handle inducing features lying in a different domain. This provides a general framework for sparse models, which are defined by a feature extraction function. Using this framework, SMGPs can be reinterpreted as fully principled models using a transformed space of local features, without any need for post-hoc variance improvements. Furthermore, it is possible to develop new sparse models of practical use, such as the proposed FIFGP and TFIFGP, which are able to outperform the state-of-the-art SPGP on some large data sets, especially for high sparsity regimes. 7 0.25 0.2 0.15 0.1 10 25 50 100 250 Mean Negative Log−Probability Normalized Mean Squared Error −3.8 SPGP FIFGP TFIFGP Full GP on 8752 data points SPGP FIFGP TFIFGP Full GP on 8752 data points −4 −4.2 −4.4 −4.6 −4.8 500 750 1000 10 Inducing features / pseudo−inputs (a) Elevators NMSE (log-log plot) 25 50 100 250 500 750 1000 Inducing features / pseudo−inputs (b) Elevators MNLP (semilog plot) 0.2 0.15 0.1 0.05 0.04 0.03 0.02 0.01 10 25 50 100 250 500 Mean Negative Log−Probability Normalized Mean Squared Error 5.5 SPGP FIFGP TFIFGP Full GP on 10000 data points 4.5 4 3.5 3 2.5 1000 SPGP FIFGP TFIFGP Full GP on 10000 data points 5 10 25 50 100 250 500 1000 Inducing features / pseudo−inputs (d) Pole Telecomm MNLP (semilog plot) Inducing features / pseudo−inputs (c) Pole Telecomm NMSE (log-log plot) Figure 2: Performance of the compared methods on Elevators and Pole Telecomm. Choosing a transformed space for the inducing features enables to use domains where the target function can be expressed more compactly, or where the evidence (which is a function of the features) is easier to optimise. This added flexibility translates as a detaching of the functional form of the input-domain covariance and the set of basis functions used to express the posterior mean. IDGPs approximate full GPs optimally in the KL sense noted in Section 3.2, for a given set of inducing features. Using ML-II to select the inducing features means that models providing a good fit to data are given preference over models that might approximate the full GP more closely. This, though rarely, might lead to harmful overfitting. To more faithfully approximate the full GP and avoid overfitting altogether, our proposal can be combined with the variational approach from [15], in which the inducing features would be regarded as variational parameters. This would result in more constrained models, which would be closer to the full GP but might show reduced performance. We have explored the case of regression with Gaussian noise, which is analytically tractable, but it is straightforward to apply the same model to other tasks such as robust regression or classification, using approximate inference (see [16]). Also, IDGPs as a general tool can be used for other purposes, such as modelling noise in the frequency domain, aggregating data from different domains or even imposing constraints on the target function. Acknowledgments We would like to thank the anonymous referees for helpful comments and suggestions. This work has been partly supported by the Spanish government under grant TEC2008- 02473/TEC, and by the Madrid Community under grant S-505/TIC/0223. 8 References [1] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, pages 1259–1266. MIT Press, 2006. [2] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems 13, pages 619–625. MIT Press, 2001. [3] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Proceedings of the 9th International Workshop on AI Stats, 2003. [4] V. Tresp. A Bayesian committee machine. Neural Computation, 12:2719–2741, 2000. [5] L. Csat´ and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(3):641–669, 2002. o [6] C. K. I. Williams and M. Seeger. Using the Nystr¨ m method to speed up kernel machines. In Advances o in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001. [7] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2006. [8] M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In Advances in Neural Information Processing Systems 21, pages 57–64, 2009. [9] Ed. Snelson. Flexible and efficient Gaussian process models for machine learning. PhD thesis, University of Cambridge, 2007. [10] J. Qui˜ onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process n regression. Journal of Machine Learning Research, 6:1939–1959, 2005. [11] C. Walder, K. I. Kim, and B. Sch¨ lkopf. Sparse multiscale Gaussian process regression. In 25th Internao tional Conference on Machine Learning. ACM Press, New York, 2008. [12] G. Potgietera and A. P. Engelbrecht. Evolving model trees for mining data sets with continuous-valued classes. Expert Systems with Applications, 35:1513–1532, 2007. [13] L. Torgo and J. Pinto da Costa. Clustered partial linear regression. In Proceedings of the 11th European Conference on Machine Learning, pages 426–436. Springer, 2000. [14] G. Potgietera and A. P. Engelbrecht. Pairwise classification as an ensemble technique. In Proceedings of the 13th European Conference on Machine Learning, pages 97–110. Springer-Verlag, 2002. [15] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Workshop on AI Stats, 2009. [16] A. Naish-Guzman and S. Holden. The generalized FITC approximation. In Advances in Neural Information Processing Systems 20, pages 1057–1064. MIT Press, 2008. 9
4 0.61010957 100 nips-2009-Gaussian process regression with Student-t likelihood
Author: Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari
Abstract: In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. However, the drawback is that the predictive accuracy of the model can be significantly compromised if the observations are contaminated by outliers. A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. The problem, however, is the analytically intractable inference. In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution. 1
5 0.52838123 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing
Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz
Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion. 1
6 0.51020485 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing
7 0.47068864 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals
8 0.46239677 163 nips-2009-Neurometric function analysis of population codes
9 0.46041363 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
10 0.44952893 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording
11 0.44790369 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
12 0.44740239 164 nips-2009-No evidence for active sparsification in the visual cortex
13 0.42199349 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing
14 0.42032129 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
15 0.41711071 70 nips-2009-Discriminative Network Models of Schizophrenia
16 0.40638238 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
17 0.40294126 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity
18 0.39995345 13 nips-2009-A Neural Implementation of the Kalman Filter
19 0.39523077 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes
20 0.3905023 125 nips-2009-Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data
topicId topicWeight
[(7, 0.019), (24, 0.025), (25, 0.057), (35, 0.035), (36, 0.061), (39, 0.028), (58, 0.572), (61, 0.011), (62, 0.013), (71, 0.028), (81, 0.037), (86, 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.98189735 43 nips-2009-Bayesian estimation of orientation preference maps
Author: Sebastian Gerwinn, Leonard White, Matthias Kaschube, Matthias Bethge, Jakob H. Macke
Abstract: Imaging techniques such as optical imaging of intrinsic signals, 2-photon calcium imaging and voltage sensitive dye imaging can be used to measure the functional organization of visual cortex across different spatial and temporal scales. Here, we present Bayesian methods based on Gaussian processes for extracting topographic maps from functional imaging data. In particular, we focus on the estimation of orientation preference maps (OPMs) from intrinsic signal imaging data. We model the underlying map as a bivariate Gaussian process, with a prior covariance function that reflects known properties of OPMs, and a noise covariance adjusted to the data. The posterior mean can be interpreted as an optimally smoothed estimate of the map, and can be used for model based interpolations of the map from sparse measurements. By sampling from the posterior distribution, we can get error bars on statistical properties such as preferred orientations, pinwheel locations or pinwheel counts. Finally, the use of an explicit probabilistic model facilitates interpretation of parameters and quantitative model comparisons. We demonstrate our model both on simulated data and on intrinsic signaling data from ferret visual cortex. 1
Author: John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, Yi Ma
Abstract: Principal component analysis is a fundamental operation in computational data analysis, with myriad applications ranging from web search to bioinformatics to computer vision and image analysis. However, its performance and applicability in real scenarios are limited by a lack of robustness to outlying or corrupted observations. This paper considers the idealized “robust principal component analysis” problem of recovering a low rank matrix A from corrupted observations D = A + E. Here, the corrupted entries E are unknown and the errors can be arbitrarily large (modeling grossly corrupted observations common in visual and bioinformatic data), but are assumed to be sparse. We prove that most matrices A can be efficiently and exactly recovered from most error sign-and-support patterns by solving a simple convex program, for which we give a fast and provably convergent algorithm. Our result holds even when the rank of A grows nearly proportionally (up to a logarithmic factor) to the dimensionality of the observation space and the number of errors E grows in proportion to the total number of entries in the matrix. A by-product of our analysis is the first proportional growth results for the related problem of completing a low-rank matrix from a small fraction of its entries. Simulations and real-data examples corroborate the theoretical results, and suggest potential applications in computer vision. 1
3 0.94796956 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems
Author: Marco Cuturi, Jean-philippe Vert, Alexandre D'aspremont
Abstract: We propose new methodologies to detect anomalies in discrete-time processes taking values in a probability space. These methods are based on the inference of functionals whose evaluations on successive states visited by the process are stationary and have low autocorrelations. Deviations from this behavior are used to flag anomalies. The candidate functionals are estimated in a subspace of a reproducing kernel Hilbert space associated with the original probability space considered. We provide experimental results on simulated datasets which show that these techniques compare favorably with other algorithms.
4 0.94528019 67 nips-2009-Directed Regression
Author: Yi-hao Kao, Benjamin V. Roy, Xiang Yan
Abstract: When used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. When there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. Empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. We propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. We demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. We also develop a theory that motivates the algorithm. 1
5 0.93409348 223 nips-2009-Sparse Metric Learning via Smooth Optimization
Author: Yiming Ying, Kaizhu Huang, Colin Campbell
Abstract: In this paper we study the problem of learning a low-rank (sparse) distance matrix. We propose a novel metric learning model which can simultaneously conduct dimension reduction and learn a distance matrix. The sparse representation involves a mixed-norm regularization which is non-convex. We then show that it can be equivalently formulated as a convex saddle (min-max) problem. From this saddle representation, we develop an efficient smooth optimization approach [17] for sparse metric learning, although the learning model is based on a nondifferentiable loss function. Finally, we run experiments to validate the effectiveness and efficiency of our sparse metric learning model on various datasets.
6 0.92432946 81 nips-2009-Ensemble Nystrom Method
7 0.74904317 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing
8 0.74796659 147 nips-2009-Matrix Completion from Noisy Entries
9 0.72411489 177 nips-2009-On Learning Rotations
10 0.70648009 243 nips-2009-The Ordered Residual Kernel for Robust Motion Subspace Clustering
11 0.69973445 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
12 0.6797576 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies
13 0.67420733 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing
14 0.66889888 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
15 0.66673076 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
16 0.66336334 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning
17 0.66117698 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
18 0.66095477 163 nips-2009-Neurometric function analysis of population codes
19 0.65721554 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints
20 0.6554451 124 nips-2009-Lattice Regression