nips nips2009 nips2009-163 knowledge-graph by maker-knowledge-mining

163 nips-2009-Neurometric function analysis of population codes

Source: pdf

Author: Philipp Berens, Sebastian Gerwinn, Alexander Ecker, Matthias Bethge

Abstract: The relative merits of different population coding schemes have mostly been analyzed in the framework of stimulus reconstruction using Fisher Information. Here, we consider the case of stimulus discrimination in a two alternative forced choice paradigm and compute neurometric functions in terms of the minimal discrimination error and the Jensen-Shannon information to study neural population codes. We ﬁrst explore the relationship between minimum discrimination error, JensenShannon Information and Fisher Information and show that the discrimination framework is more informative about the coding accuracy than Fisher Information as it deﬁnes an error for any pair of possible stimuli. In particular, it includes Fisher Information as a special case. Second, we use the framework to study population codes of angular variables. Speciﬁcally, we assess the impact of different noise correlations structures on coding accuracy in long versus short decoding time windows. That is, for long time window we use the common Gaussian noise approximation. To address the case of short time windows we analyze the Ising model with identical noise correlation structure. In this way, we provide a new rigorous framework for assessing the functional consequences of noise correlation structures for the representational accuracy of neural population codes that is in particular applicable to short-time population coding. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Neurometric function analysis of population codes Philipp Berens, Sebastian Gerwinn, Alexander S. [sent-1, score-0.577]

2 de Abstract The relative merits of different population coding schemes have mostly been analyzed in the framework of stimulus reconstruction using Fisher Information. [sent-5, score-0.951]

3 Here, we consider the case of stimulus discrimination in a two alternative forced choice paradigm and compute neurometric functions in terms of the minimal discrimination error and the Jensen-Shannon information to study neural population codes. [sent-6, score-1.635]

4 We ﬁrst explore the relationship between minimum discrimination error, JensenShannon Information and Fisher Information and show that the discrimination framework is more informative about the coding accuracy than Fisher Information as it deﬁnes an error for any pair of possible stimuli. [sent-7, score-0.89]

5 Second, we use the framework to study population codes of angular variables. [sent-9, score-0.646]

6 Speciﬁcally, we assess the impact of different noise correlations structures on coding accuracy in long versus short decoding time windows. [sent-10, score-0.57]

7 To address the case of short time windows we analyze the Ising model with identical noise correlation structure. [sent-12, score-0.167]

8 In this way, we provide a new rigorous framework for assessing the functional consequences of noise correlation structures for the representational accuracy of neural population codes that is in particular applicable to short-time population coding. [sent-13, score-1.271]

9 1 Introduction The relative merits of different population coding schemes have mostly been studied (e. [sent-14, score-0.756]

10 [1, 12], for a review see [2]) in the framework of stimulus reconstruction (ﬁgure 1a), where the performance ˆ of a code is judged on the basis of the mean squared error E[(θ − θ)2 ]. [sent-16, score-0.291]

11 That is, if a stimulus θ is encoded by a population of N neurons with tuning curves fi , we ask how well, on average, can an estimator reconstruct the true value of the presented stimulus based on the neural responses r, which were generated by the density p(r|θ). [sent-17, score-1.141]

12 ˆ Jθ 1 a b c θ1 Error Stimulus Neural Response Stimulus reconstruction Stimulus discrimination θ2 Figure 1: Illustration of the two frameworks for studying population codes. [sent-23, score-0.755]

13 In stimulus reconstruction, an estimator tries to reconstruct the orientation of a stimulus based on a noisy neural response. [sent-25, score-0.391]

14 In stimulus discrimination, an ideal observer needs to choose one of two possible stimuli based on a noisy neural response (2AFC task). [sent-28, score-0.25]

15 A neurometric function shows the error E as a function of ∆θ, the difference between a reference direction θ1 and a second direction θ2 . [sent-30, score-0.374]

16 For the comparison of different coding schemes, it is important that an estimator exists which can actually attain this lower bound. [sent-32, score-0.268]

17 For short time windows and certain types of tuning functions, this may not always be the case [4]. [sent-33, score-0.197]

18 In particular, it is unclear how different population coding schemes affect the ﬁdelity with which a population of binary neurons can encode a stimulus variable. [sent-34, score-1.375]

19 The minimal discrimination error is achieved by the Bayes optimal classiﬁer ˆ θ = argmaxs p(s|r) where s ∈ {θ1 , θ2 } and the prior distribution p(s) = 1 . [sent-37, score-0.395]

20 IJS is an interesting measure of coding q2 (x) accuracy since it directly measures the mutual information between the neural responses and the ‘class label’, i. [sent-40, score-0.315]

21 By observing a population response pattern r, the uncertainty (in terms of entropy) about the stimulus is reduced by MI(r, s) = p(s) p(r|s) log s p(r|s) dr = IJS , p(r|s)p(s) s with prior distribution as above. [sent-43, score-0.67]

22 In the following, we will restrict our analysis to the special case of shift-invariant population codes for angular variables and compute neurometric functions E(∆θ) and IJS (∆θ) (ﬁgure 1c) by setting θ1 = θ and θ2 = θ + ∆θ. [sent-44, score-0.984]

23 Illustration of the connections between the proposed measures of coding accuracy. [sent-72, score-0.265]

24 Minimal discrimination error E(∆θ) (red) is shown as a neurometric curve as a function of ∆θ and is bounded in terms of the Jensen-Shannon information IJS (∆θ) via equations 4 and 5 (black). [sent-73, score-0.659]

25 The computations have been caried out for a population of N = 50 neurons, with average correlations ρ = . [sent-76, score-0.563]

27 2 Links between the proposed measures In this section, we link the Fisher Information Jθ of a population code p(r|θ) to the minimum discrimination error E(∆θ) and the Jensen-Shannon Information IJS (∆θ) in the 2AFC paradigm. [sent-85, score-0.817]

28 Second, we bound the minimum discrimination error in terms of the Jensen-Shannon information. [sent-87, score-0.365]

29 Cosine-type tuning functions with rates between 5 and 50 Hz. [sent-92, score-0.202]

30 Box-like tuning function with matched minimal and maximal ﬁring rates. [sent-94, score-0.236]

31 Cosine tuning function resembles the orientation tuning functions of many cortical neurons. [sent-95, score-0.425]

32 Box-like tuning functions, in contrast, have non-constant Fisher Information due to their steep non-linearity. [sent-97, score-0.159]

33 They have been shown to exhibit superior performance over cosine-like tuning functions with respect to the mean squared error [4]. [sent-98, score-0.26]

34 2 From Jensen-Shannon Information to Minimal Discrimination Error The minimal discrimination error E(∆θ) of an ideal observer is bounded from above and below in terms of IJS (∆θ). [sent-105, score-0.427]

35 In ﬁgure 2c we show the minimal discrimination error for a population code (red) together with the upper and lower bound (black) obtained by inserting IJS (∆θ) into equations 4 and 5. [sent-114, score-0.942]

36 cosine (black) tuning functions in short-term population codes of a. [sent-128, score-0.867]

37 Although box-like tuning functions are much broader than cosine tuning functions, Ebox lies usually below Ecos . [sent-132, score-0.449]

38 For the cosine case, FI (dashed, approximation as in ﬁgure 2c and Ed (grey) provide accurate accounts of coding accuracy. [sent-133, score-0.329]

39 In contrast, FI grossly overestimates the discrimination error for box-like tuning functions in small and medium sized populations. [sent-134, score-0.616]

40 3 Previous work Only a small number of studies on neural population coding have used other measures than Fisher Information [18, 3, 6, 4]. [sent-139, score-0.757]

41 Two approaches are most closely related to ours: Snippe and Koenderink [18] and Averbeck and Lee [3] used a measure analogous to the sensitivity index d (d )2 = ∆µΣ−1 ∆µ ∆µ := f (θ + ∆θ) − f (θ) (6) as a measure of coding accuracy. [sent-140, score-0.241]

42 While Snippe and Koenderink have considered only the limit 1 ∆θ → 0, Averbeck and Lee evaluated equation 6 for ﬁnite ∆θ using Σ = 2 (Σθ + Σθ+∆θ ) and converted d to a discrimination error Ed = 1 − erf(d /2). [sent-141, score-0.341]

43 In that particular case, the entire neurometric function is fully determined by the Fisher Information [9]: d = (∆θ) Jθ = (∆θ) Jmean Jmean is the linear part of the Fisher Information (cf. [sent-143, score-0.316]

44 In the general case, it is not obvious what aspects of the quality of a population code are captured by the above measure. [sent-145, score-0.475]

45 Fisher Information, on the other hand, can be quite uninformative about the coding accuracy of the population, especially when the tuning functions are highly nonlinear (see ﬁgure 3) or noise is large, as in these cases it is not certain whether the Cramer-Rao bound can actually be attained [4]. [sent-148, score-0.587]

46 The examples studied in the next section demonstrate how these shortcomings can be overcome using the minimal discrimination error (equation 1). [sent-149, score-0.419]

47 3 Results After describing the population model used in this study, we will illustrate in a simple example, how our proposed framework is more informative than previous approaches. [sent-150, score-0.48]

48 Second, we will investigate how different noise correlations structures impact population coding on different timescales. [sent-151, score-0.941]

49 1 The population model In this section, we describe in detail the population model used in the remainder of the study. [sent-153, score-0.874]

50 We consider a population of N neurons tuned to orientation, where the ﬁring rate of neuron i follows an average tuning proﬁle fi (θ) with (a) a cosine-like shape fi (θ) = λ1 + λ2 ak (θ − φi ) with k = 1 in section 3. [sent-156, score-0.975]

51 That is, for short-term population coding, we assume the population acitivity to be binary with each neuron either emitting one spike or none. [sent-164, score-0.95]

52 [12], we model the stimulus-dependent covariance matrix as Σij (θ) = δij vi (θ) + (1 − δij )ρij (θ) vi (θ)vj (θ), where vi (θ) is the variance of cell i and ρij (θ) the correlation coefﬁcient. [sent-167, score-0.178]

53 For long-term coding, we set vi (θ) = fi (θ) and for short-term coding, we set vi (θ) = fi (θ)(1 − fi (θ)). [sent-168, score-0.394]

54 We allow for both stimulus and spatial inﬂuences on ρ by setting ρij (θ) = σij (θ)c(φi − φj ), where φi is the preferred orientation of neuron i. [sent-169, score-0.296]

55 2 Minimum discrimination error is more informative than Fisher Information As has been pointed out in [4], the shape of unimodal tuning functions can strongly inﬂuence the coding accuracy of population codes of angular variables. [sent-175, score-1.467]

56 In particular, box-like tuning functions can be superior to cosine tuning functions. [sent-176, score-0.449]

57 However, numerical evaluation of the minimum mean squared error for angular variables is much more difﬁcult than the evaluation of the minimal discrimination error proposed here, and the above claim has only been veriﬁed up to N = 20 neurons. [sent-177, score-0.501]

58 Here we compute the full neurometric functions for N = 10, 50, 250 binary neurons (ﬁgure 4). [sent-178, score-0.444]

59 In this way, we show that the advantage of box-like tuning functions also holds for large numbers of neurons (compare red and black curves in ﬁgure 4 a-c). [sent-179, score-0.371]

60 In addition, we note that Fisher Information does not provide an accurate account of the performance of box-like tuning functions: it fails as soon as the nonlinearity in the tuning functions becomes effective and overestimates the true minimal discrimination error E. [sent-180, score-0.781]

61 Similarly, the approximate neurometric functions Ed (∆θ) obtained from equation 6 do not capture the shape of neurometric functions E(∆θ) but underestimate the minimal discrimination error. [sent-181, score-1.109]

62 In contrast, the deviation between both curves stays rather small for cosine tuning functions. [sent-182, score-0.268]

63 3 Stimulus-dependent correlations have opposite effects for long- and short-term population coding The shape of the noise covariance matrix Σθ can strongly inﬂuence the coding ﬁdelity of a neural population. [sent-184, score-1.219]

64 In this section, we will use our new framework to study different noise correlation structures for short- and long-term population coding. [sent-186, score-0.621]

65 Previous studies so far have investigated the effect of noise correlations in the long-term case: Most studies assumed p(r|θ) to follow a multivariate Gaussian distribution, so that ﬁring rates r|θ ∼ N (f (θ), Σ(θ)) (for detailed description of the population model see section 3. [sent-187, score-0.698]

66 2 0 0 5 ∆ θ (deg) 10 0 0 10 ∆ θ (deg) 20 0 100 Figure 5: Neurometric functions E(∆θ) (a-c) and IJS (∆θ) (d-f) for four different noise correlation structures. [sent-219, score-0.172]

67 Medium sized population (N = 15) and long-term coding. [sent-225, score-0.472]

68 Medium sized population (N = 15) and short-term coding. [sent-229, score-0.472]

69 The impact of stimulus-dependent noise correlations in the absence of limited range correlations changes from b/e to c/f (red line). [sent-230, score-0.441]

70 While they are beneﬁcial in long-term coding, they are beneﬁcial in short-term coding only for close angles. [sent-231, score-0.241]

71 FI of the population takes a particularly simple form. [sent-235, score-0.437]

72 For this case, various studies have investigated noise structures where correlations were either uniform across the population (ﬁgure 3c) or their magnitude decayed with difference in preferred orientations (ﬁgure 3d), ‘limited range structure’ or ‘spatial decay’, see e. [sent-238, score-0.754]

73 ﬁnd that in the absence of limited range correlations, stimulus-dependent noise correlations (ﬁgure 3e) are beneﬁcial for a population code, while in their presence (ﬁgure 3f), they are detrimental. [sent-243, score-0.718]

74 We ﬁrst compute the neurometric functions E(∆θ) and IJS (∆θ) for a population of 100 neurons in the case of long-term coding with a Gaussian noise model for the four possible noise correlation structures (ﬁgure 5a). [sent-244, score-1.354]

75 in that we ﬁnd that the lowest E or the highest IJS is achieved for a population with stimulus-dependent noise correlations and no limited range structure, while a population with stimulus-dependent noise correlations in the presence of spatial decay performs worst. [sent-246, score-1.355]

76 Spatially uniform correlations (ﬁgure 3c) provide almost as good a code as the best coding scheme. [sent-247, score-0.405]

77 7 Next, we directly compare long- and short-term population coding in a population of 15 neurons1 . [sent-248, score-1.115]

78 For short-term coding, we assume that the population activity is of binary nature, i. [sent-249, score-0.437]

79 Again, we compute neurometric functions E(∆θ) and IJS (∆θ) for all four possible correlation structures. [sent-252, score-0.419]

80 The results for long-term coding do not differ between large and small populations (ﬁgure 5b), although relative differences between different coding schemes are less prominent. [sent-253, score-0.547]

81 In contrast, we ﬁnd that the beneﬁcial impact of stimulus-dependent correlations in the absence of limited range structure reverses in short-term codes for large ∆θ (ﬁgure 5c). [sent-254, score-0.386]

82 4 Discussion In this paper, we introduce the computation of neurometric functions as a new framework for studying the representational accuracy of neural population codes. [sent-255, score-0.913]

83 Importantly, it allows for a rigorous treatment of nonlinear population codes (e. [sent-256, score-0.577]

84 box-like tuning functions) and noise correlations for non-Gaussian noise models. [sent-258, score-0.423]

85 This is particularly important for binary population codes on timescales where neurons ﬁre at most one spike. [sent-259, score-0.702]

86 Such codes are of special interest since psychophysical experiments have demonstrated that efﬁcient computations can be performed in cortex on short time scales [19]. [sent-260, score-0.218]

87 Previous studies have mostly focussed on long-term population codes, since in this case it is possible to study many question analytically using Fisher Information. [sent-261, score-0.49]

88 Although the structure of neural population acitivity on short timescales has recently attracted much interest [16, 17, 15], population codes for binary population activity and, in particular, the impact of different noise correlation structures on such codes are not well understood. [sent-262, score-1.925]

89 In contrast to previous work [14], neurometric function analysis allows for a comprehensive treatment of both short- and long-term population codes in a single framework. [sent-263, score-0.893]

90 3, we have started to study population codes on short timescales and found important differences in the effect of noise correlations between short- and long-term population codes. [sent-265, score-1.287]

91 2 demonstrates that neurometric functions can provide additional information compared to Fisher Information: While Fisher Information is a single number for each potential population code, neurometric functions in terms of E or IJS assess the coding quality for each pair of stimuli. [sent-268, score-1.396]

92 This also enables us to detect effects like the dependence of the relative performance of different population codes on ∆θ as shown in ﬁgure 5 c and f. [sent-269, score-0.598]

93 The framework of stimulus discrimination in a 2AFC task has long been used in psychophysical and neurophysiological studies for measuring the accuracy of orientation coding in the visual system (e. [sent-273, score-0.851]

94 It is therefore appealing to use the same framework also in theoretical investigations on neural population coding since this facilitates the comparison with experimental data. [sent-276, score-0.721]

95 Furthermore, it allows studying population codes for categorial variables since, in contrast to Fisher Information, it does not require the variable of interest to be continuous. [sent-277, score-0.6]

96 The effect of correlated variability on the accuracy of a population code. [sent-288, score-0.465]

97 Effects of noise correlations on information encoding and decoding. [sent-303, score-0.195]

98 Visual orientation and spatial frequency discrimination: a comparison of single neurons and behavior. [sent-320, score-0.176]

99 On decoding the responses of a population of neurons from short time windows. [sent-385, score-0.56]

100 Neuronal population coding of continuous and discrete quantity in the primate posterior parietal cortex. [sent-431, score-0.699]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ijs', 0.483), ('population', 0.437), ('neurometric', 0.316), ('discrimination', 0.26), ('coding', 0.241), ('fisher', 0.208), ('tuning', 0.159), ('codes', 0.14), ('stimulus', 0.139), ('correlations', 0.126), ('fi', 0.112), ('deg', 0.097), ('gure', 0.093), ('jmean', 0.093), ('josic', 0.093), ('cosine', 0.088), ('neurons', 0.085), ('averbeck', 0.081), ('minimal', 0.077), ('jcov', 0.074), ('dkl', 0.073), ('noise', 0.069), ('ising', 0.065), ('orientation', 0.064), ('dr', 0.063), ('correlation', 0.06), ('error', 0.058), ('grey', 0.056), ('snippe', 0.056), ('sher', 0.048), ('angular', 0.048), ('bound', 0.047), ('sd', 0.045), ('functions', 0.043), ('ij', 0.04), ('timescales', 0.04), ('psychophysical', 0.04), ('neuron', 0.039), ('code', 0.038), ('short', 0.038), ('acitivity', 0.037), ('medium', 0.036), ('neurophysiol', 0.036), ('limited', 0.036), ('var', 0.036), ('schemes', 0.036), ('reconstruction', 0.035), ('sized', 0.035), ('black', 0.034), ('ring', 0.034), ('impact', 0.034), ('structures', 0.034), ('studies', 0.033), ('koenderink', 0.033), ('illustration', 0.032), ('observer', 0.032), ('response', 0.031), ('shape', 0.031), ('covariance', 0.031), ('entropy', 0.031), ('vi', 0.029), ('populations', 0.029), ('red', 0.029), ('accuracy', 0.028), ('range', 0.028), ('fano', 0.028), ('estimator', 0.027), ('preferred', 0.027), ('spatial', 0.027), ('bethge', 0.026), ('stimuli', 0.026), ('cos', 0.026), ('box', 0.026), ('neurophysiological', 0.025), ('overestimates', 0.025), ('equations', 0.025), ('bounds', 0.025), ('si', 0.024), ('measures', 0.024), ('shortcomings', 0.024), ('forced', 0.023), ('representational', 0.023), ('studying', 0.023), ('lin', 0.023), ('equation', 0.023), ('neuroscience', 0.023), ('bingen', 0.022), ('neural', 0.022), ('informative', 0.022), ('delity', 0.022), ('merits', 0.022), ('absence', 0.022), ('framework', 0.021), ('effects', 0.021), ('primate', 0.021), ('curves', 0.021), ('mostly', 0.02), ('clarity', 0.02), ('bits', 0.02), ('hz', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 163 nips-2009-Neurometric function analysis of population codes

Author: Philipp Berens, Sebastian Gerwinn, Alexander Ecker, Matthias Bethge

2 0.22306554 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

Author: Sebastian Gerwinn, Philipp Berens, Matthias Bethge

Abstract: Second-order maximum-entropy models have recently gained much interest for describing the statistics of binary spike trains. Here, we extend this approach to take continuous stimuli into account as well. By constraining the joint secondorder statistics, we obtain a joint Gaussian-Boltzmann distribution of continuous stimuli and binary neural ﬁring patterns, for which we also compute marginal and conditional distributions. This model has the same computational complexity as pure binary models and ﬁtting it to data is a convex problem. We show that the model can be seen as an extension to the classical spike-triggered average/covariance analysis and can be used as a non-linear method for extracting features which a neural population is sensitive to. Further, by calculating the posterior distribution of stimuli given an observed neural response, the model can be used to decode stimuli and yields a natural spike-train metric. Therefore, extending the framework of maximum-entropy models to continuous variables allows us to gain novel insights into the relationship between the ﬁring patterns of neural ensembles and the stimuli they are processing. 1

3 0.19194488 183 nips-2009-Optimal context separation of spiking haptic signals by second-order somatosensory neurons

Author: Romain Brasselet, Roland Johansson, Angelo Arleo

Abstract: We study an encoding/decoding mechanism accounting for the relative spike timing of the signals propagating from peripheral nerve ﬁbers to second-order somatosensory neurons in the cuneate nucleus (CN). The CN is modeled as a population of spiking neurons receiving as inputs the spatiotemporal responses of real mechanoreceptors obtained via microneurography recordings in humans. The efﬁciency of the haptic discrimination process is quantiﬁed by a novel deﬁnition of entropy that takes into full account the metrical properties of the spike train space. This measure proves to be a suitable decoding scheme for generalizing the classical Shannon entropy to spike-based neural codes. It permits an assessment of neurotransmission in the presence of a large output space (i.e. hundreds of spike trains) with 1 ms temporal precision. It is shown that the CN population code performs a complete discrimination of 81 distinct stimuli already within 35 ms of the ﬁrst afferent spike, whereas a partial discrimination (80% of the maximum information transmission) is possible as rapidly as 15 ms. This study suggests that the CN may not constitute a mere synaptic relay along the somatosensory pathway but, rather, it may convey optimal contextual accounts (in terms of fast and reliable information transfer) of peripheral tactile inputs to downstream structures of the central nervous system. 1

4 0.17398392 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

Author: Kai Yu, Tong Zhang, Yihong Gong

Abstract: This paper introduces a new method for semi-supervised learning on high dimensional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point x on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difﬁcult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. 1

5 0.15986733 164 nips-2009-No evidence for active sparsification in the visual cortex

Author: Pietro Berkes, Ben White, Jozsef Fiser

Abstract: The proposal that cortical activity in the visual cortex is optimized for sparse neural activity is one of the most established ideas in computational neuroscience. However, direct experimental evidence for optimal sparse coding remains inconclusive, mostly due to the lack of reference values on which to judge the measured sparseness. Here we analyze neural responses to natural movies in the primary visual cortex of ferrets at different stages of development and of rats while awake and under different levels of anesthesia. In contrast with prediction from a sparse coding model, our data shows that population and lifetime sparseness decrease with visual experience, and increase from the awake to anesthetized state. These results suggest that the representation in the primary visual cortex is not actively optimized to maximize sparseness. 1

6 0.14354061 52 nips-2009-Code-specific policy gradient rules for spiking neurons

7 0.1394213 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

8 0.1060544 43 nips-2009-Bayesian estimation of orientation preference maps

9 0.10530438 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning

10 0.10128989 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

11 0.096668132 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

12 0.086475566 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

13 0.083800487 104 nips-2009-Group Sparse Coding

14 0.081616007 13 nips-2009-A Neural Implementation of the Kalman Filter

15 0.078006193 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

16 0.069694288 200 nips-2009-Reconstruction of Sparse Circuits Using Multi-neuronal Excitation (RESCUME)

17 0.053895406 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

18 0.05355601 88 nips-2009-Extending Phase Mechanism to Differential Motion Opponency for Motion Pop-out

19 0.051871717 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording

20 0.050790656 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.165), (1, -0.153), (2, 0.222), (3, 0.167), (4, 0.052), (5, -0.065), (6, -0.078), (7, 0.035), (8, 0.049), (9, 0.035), (10, -0.043), (11, -0.009), (12, 0.036), (13, -0.029), (14, 0.015), (15, -0.034), (16, 0.009), (17, 0.098), (18, 0.018), (19, -0.081), (20, -0.046), (21, 0.06), (22, 0.015), (23, 0.013), (24, 0.01), (25, -0.076), (26, 0.04), (27, 0.061), (28, -0.047), (29, 0.022), (30, 0.228), (31, -0.064), (32, -0.156), (33, 0.191), (34, 0.011), (35, 0.057), (36, -0.108), (37, -0.021), (38, -0.185), (39, 0.095), (40, 0.022), (41, 0.075), (42, 0.002), (43, -0.086), (44, 0.081), (45, -0.108), (46, -0.013), (47, -0.051), (48, 0.012), (49, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97558117 163 nips-2009-Neurometric function analysis of population codes

Author: Philipp Berens, Sebastian Gerwinn, Alexander Ecker, Matthias Bethge

2 0.76204395 164 nips-2009-No evidence for active sparsification in the visual cortex

Author: Pietro Berkes, Ben White, Jozsef Fiser

3 0.71458423 183 nips-2009-Optimal context separation of spiking haptic signals by second-order somatosensory neurons

Author: Romain Brasselet, Roland Johansson, Angelo Arleo

4 0.70786601 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

Author: Sebastian Gerwinn, Philipp Berens, Matthias Bethge

5 0.63080215 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

Author: Kai Yu, Tong Zhang, Yihong Gong

6 0.54072118 52 nips-2009-Code-specific policy gradient rules for spiking neurons

7 0.53625029 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

8 0.45826584 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

9 0.38782978 247 nips-2009-Time-rescaling methods for the estimation and assessment of non-Poisson neural encoding models

10 0.38695872 43 nips-2009-Bayesian estimation of orientation preference maps

11 0.3702749 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

12 0.35997409 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

13 0.35908622 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

14 0.34379786 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

15 0.32063809 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning

16 0.2896831 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording

17 0.28685513 138 nips-2009-Learning with Compressible Priors

18 0.28029108 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

19 0.27781337 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference

20 0.27563986 152 nips-2009-Measuring model complexity with the prior predictive

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.241), (24, 0.049), (25, 0.074), (35, 0.068), (36, 0.064), (39, 0.033), (58, 0.124), (61, 0.028), (62, 0.011), (71, 0.048), (81, 0.034), (86, 0.062), (91, 0.04), (92, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8255083 163 nips-2009-Neurometric function analysis of population codes

Author: Philipp Berens, Sebastian Gerwinn, Alexander Ecker, Matthias Bethge

2 0.63205427 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

Author: Sebastian Gerwinn, Philipp Berens, Matthias Bethge

3 0.61337179 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

Author: Arno Onken, Steffen Grünewälder, Klaus Obermayer

Abstract: The linear correlation coefﬁcient is typically used to characterize and analyze dependencies of neural spike counts. Here, we show that the correlation coefﬁcient is in general insufﬁcient to characterize these dependencies. We construct two neuron spike count models with Poisson-like marginals and vary their dependence structure using copulas. To this end, we construct a copula that allows to keep the spike counts uncorrelated while varying their dependence strength. Moreover, we employ a network of leaky integrate-and-ﬁre neurons to investigate whether weakly correlated spike counts with strong dependencies are likely to occur in real networks. We ﬁnd that the entropy of uncorrelated but dependent spike count distributions can deviate from the corresponding distribution with independent components by more than 25 % and that weakly correlated but strongly dependent spike counts are very likely to occur in biological networks. Finally, we introduce a test for deciding whether the dependence structure of distributions with Poissonlike marginals is well characterized by the linear correlation coefﬁcient and verify it for different copula-based models. 1

4 0.61234939 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

Author: Jaakko Luttinen, Alexander T. Ihler

Abstract: We present a probabilistic factor analysis model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.

5 0.60540354 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

Author: Anibal Figueiras-vidal, Miguel Lázaro-gredilla

Abstract: We present a general inference framework for inter-domain Gaussian Processes (GPs) and focus on its usefulness to build sparse GP models. The state-of-the-art sparse GP model introduced by Snelson and Ghahramani in [1] relies on ﬁnding a small, representative pseudo data set of m elements (from the same domain as the n available data elements) which is able to explain existing data well, and then uses it to perform inference. This reduces inference and model selection computation time from O(n3 ) to O(m2 n), where m n. Inter-domain GPs can be used to ﬁnd a (possibly more compact) representative set of features lying in a different domain, at the same computational cost. Being able to specify a different domain for the representative features allows to incorporate prior knowledge about relevant characteristics of data and detaches the functional form of the covariance and basis functions. We will show how previously existing models ﬁt into this framework and will use it to develop two new sparse GP models. Tests on large, representative regression data sets suggest that signiﬁcant improvement can be achieved, while retaining computational efﬁciency. 1 Introduction and previous work Along the past decade there has been a growing interest in the application of Gaussian Processes (GPs) to machine learning tasks. GPs are probabilistic non-parametric Bayesian models that combine a number of attractive characteristics: They achieve state-of-the-art performance on supervised learning tasks, provide probabilistic predictions, have a simple and well-founded model selection scheme, present no overﬁtting (since parameters are integrated out), etc. Unfortunately, the direct application of GPs to regression problems (with which we will be concerned here) is limited due to their training time being O(n3 ). To overcome this limitation, several sparse approximations have been proposed [2, 3, 4, 5, 6]. In most of them, sparsity is achieved by projecting all available data onto a smaller subset of size m n (the active set), which is selected according to some speciﬁc criterion. This reduces computation time to O(m2 n). However, active set selection interferes with hyperparameter learning, due to its non-smooth nature (see [1, 3]). These proposals have been superseded by the Sparse Pseudo-inputs GP (SPGP) model, introduced in [1]. In this model, the constraint that the samples of the active set (which are called pseudoinputs) must be selected among training data is relaxed, allowing them to lie anywhere in the input space. This allows both pseudo-inputs and hyperparameters to be selected in a joint continuous optimisation and increases ﬂexibility, resulting in much superior performance. In this work we introduce Inter-Domain GPs (IDGPs) as a general tool to perform inference across domains. This allows to remove the constraint that the pseudo-inputs must remain within the same domain as input data. This added ﬂexibility results in an increased performance and allows to encode prior knowledge about other domains where data can be represented more compactly. 1 2 Review of GPs for regression We will brieﬂy state here the main deﬁnitions and results for regression with GPs. See [7] for a comprehensive review. Assume we are given a training set with n samples D ≡ {xj , yj }n , where each D-dimensional j=1 input xj is associated to a scalar output yj . The regression task goal is, given a new input x∗ , predict the corresponding output y∗ based on D. The GP regression model assumes that the outputs can be expressed as some noiseless latent function plus independent noise, y = f (x)+ε, and then sets a zero-mean1 GP prior on f (x), with covariance k(x, x ), and a zero-mean Gaussian prior on ε, with variance σ 2 (the noise power hyperparameter). The covariance function encodes prior knowledge about the smoothness of f (x). The most common choice for it is the Automatic Relevance Determination Squared Exponential (ARD SE): 2 k(x, x ) = σ0 exp − 1 2 D d=1 (xd − xd )2 2 d , (1) 2 with hyperparameters σ0 (the latent function power) and { d }D (the length-scales, deﬁning how d=1 rapidly the covariance decays along each dimension). It is referred to as ARD SE because, when coupled with a model selection method, non-informative input dimensions can be removed automatically by growing the corresponding length-scale. The set of hyperparameters that deﬁne the GP are 2 θ = {σ 2 , σ0 , { d }D }. We will omit the dependence on θ for the sake of clarity. d=1 If we evaluate the latent function at X = {xj }n , we obtain a set of latent variables following a j=1 joint Gaussian distribution p(f |X) = N (f |0, Kﬀ ), where [Kﬀ ]ij = k(xi , xj ). Using this model it is possible to express the joint distribution of training and test cases and then condition on the observed outputs to obtain the predictive distribution for any test case pGP (y∗ |x∗ , D) = N (y∗ |kf ∗ (Kﬀ + σ 2 In )−1 y, σ 2 + k∗∗ − kf ∗ (Kﬀ + σ 2 In )−1 kf ∗ ), (2) where y = [y1 , . . . , yn ] , kf ∗ = [k(x1 , x∗ ), . . . , k(xn , x∗ )] , and k∗∗ = k(x∗ , x∗ ). In is used to denote the identity matrix of size n. The O(n3 ) cost of these equations arises from the inversion of the n × n covariance matrix. Predictive distributions for additional test cases take O(n2 ) time each. These costs make standard GPs impractical for large data sets. To select hyperparameters θ, Type-II Maximum Likelihood (ML-II) is commonly used. This amounts to selecting the hyperparameters that correspond to a (possibly local) maximum of the log-marginal likelihood, also called log-evidence. 3 Inter-domain GPs In this section we will introduce Inter-Domain GPs (IDGPs) and show how they can be used as a framework for computationally efﬁcient inference. Then we will use this framework to express two previous relevant models and develop two new ones. 3.1 Deﬁnition Consider a real-valued GP f (x) with x ∈ RD and some deterministic real function g(x, z), with z ∈ RH . We deﬁne the following transformation: u(z) = f (x)g(x, z)dx. (3) RD There are many examples of transformations that take on this form, the Fourier transform being one of the best known. We will discuss possible choices for g(x, z) in Section 3.3; for the moment we will deal with the general form. Since u(z) is obtained by a linear transformation of GP f (x), 1 We follow the common approach of subtracting the sample mean from the outputs and then assume a zero-mean model. 2 it is also a GP. This new GP may lie in a different domain of possibly different dimension. This transformation is not invertible in general, its properties being deﬁned by g(x, z). IDGPs arise when we jointly consider f (x) and u(z) as a single, “extended” GP. The mean and covariance function of this extended GP are overloaded to accept arguments from both the input and transformed domains and treat them accordingly. We refer to each version of an overloaded function as an instance, which will accept a different type of arguments. If the distribution of the original GP is f (x) ∼ GP(m(x), k(x, x )), then it is possible to compute the remaining instances that deﬁne the distribution of the extended GP over both domains. The transformed-domain instance of the mean is m(z) = E[u(z)] = E[f (x)]g(x, z)dx = m(x)g(x, z)dx. RD RD The inter-domain and transformed-domain instances of the covariance function are: k(x, z ) = E[f (x)u(z )] = E f (x) f (x )g(x , z )dx = RD k(z, z ) = E[u(z)u(z )] = E f (x)g(x, z)dx RD = k(x, x )g(x , z )dx f (x )g(x , z )dx RD k(x, x )g(x, z)g(x , z )dxdx . RD (4) RD (5) RD Mean m(·) and covariance function k(·, ·) are therefore deﬁned both by the values and domains of their arguments. This can be seen as if each argument had an additional domain indicator used to select the instance. Apart from that, they deﬁne a regular GP, and all standard properties hold. In particular k(a, b) = k(b, a). This approach is related to [8], but here the latent space is deﬁned as a transformation of the input space, and not the other way around. This allows to pre-specify the desired input-domain covariance. The transformation is also more general: Any g(x, z) can be used. We can sample an IDGP at n input-domain points f = [f1 , f2 , . . . , fn ] (with fj = f (xj )) and m transformed-domain points u = [u1 , u2 , . . . , um ] (with ui = u(zi )). With the usual assumption of f (x) being a zero mean GP and deﬁning Z = {zi }m , the joint distribution of these samples is: i=1 Kﬀ Kfu f f 0, X, Z = N p , (6) u u Kfu Kuu with [Kﬀ ]pq = k(xp , xq ), [Kfu ]pq = k(xp , zq ), [Kuu ]pq = k(zp , zq ), which allows to perform inference across domains. We will only be concerned with one input domain and one transformed domain, but IDGPs can be deﬁned for any number of domains. 3.2 Sparse regression using inducing features In the standard regression setting, we are asked to perform inference about the latent function f (x) from a data set D lying in the input domain. Using IDGPs, we can use data from any domain to perform inference in the input domain. Some latent functions might be better deﬁned by a set of data lying in some transformed space rather than in the input space. This idea is used for sparse inference. Following [1] we introduce a pseudo data set, but here we place it in the transformed domain: D = {Z, u}. The following derivation is analogous to that of SPGP. We will refer to Z as the inducing features and u as the inducing variables. The key approximation leading to sparsity is to set m n and assume that f (x) is well-described by the pseudo data set D, so that any two samples (either from the training or test set) fp and fq with p = q will be independent given xp , xq and D. With this simplifying assumption2 , the prior over f can be factorised as a product of marginals: n p(f |X, Z, u) ≈ p(fj |xj , Z, u). (7) j=1 2 Alternatively, (7) can be obtained by proposing a generic factorised form for the approximate conn ditional p(f |X, Z, u) ≈ q(f |X, Z, u) = q (f |xj , Z, u) and then choosing the set of funcj=1 j j tions {qj (·)}n so as to minimise the Kullback-Leibler (KL) divergence from the exact joint prior j=1 KL(p(f |X, Z, u)p(u|Z)||q(f |X, Z, u)p(u|Z)), as noted in [9], Section 2.3.6. 3 Marginals are in turn obtained from (6): p(fj |xj , Z, u) = N (fj |kj K−1 u, λj ), where kj is the j-th uu row of Kfu and λj is the j-th element of the diagonal of matrix Λf = diag(Kf f − Kfu K−1 Kuf ). uu Operator diag(·) sets all off-diagonal elements to zero, so that Λf is a diagonal matrix. Since p(u|Z) is readily available and also Gaussian, the inducing variables can be integrated out from (7), yielding a new, approximate prior over f (x): n p(f |X, Z) = p(fj |xj , Z, u)p(u|Z)du = N (f |0, Kfu K−1 Kuf + Λf ) uu p(f , u|X, Z)du ≈ j=1 Using this approximate prior, the posterior distribution for a test case is: pIDGP (y∗ |x∗ , D, Z) = N (y∗ |ku∗ Q−1 Kfu Λ−1 y, σ 2 + k∗∗ + ku∗ (Q−1 − K−1 )ku∗ ), y uu (8) Kfu Λ−1 Kfu y where we have deﬁned Q = Kuu + and Λy = Λf + σ 2 In . The distribution (2) is approximated by (8) with the information available in the pseudo data set. After O(m2 n) time precomputations, predictive means and variances can be computed in O(m) and O(m2 ) time per test case, respectively. This model is, in general, non-stationary, even when it is approximating a stationary input-domain covariance and can be interpreted as a degenerate GP plus heteroscedastic white noise. The log-marginal likelihood (or log-evidence) of the model, explicitly including the conditioning on kernel hyperparameters θ can be expressed as 1 log p(y|X, Z, θ) = − [y Λ−1 y−y Λ−1 Kfu Q−1 Kfu Λ−1 y+log(|Q||Λy |/|Kuu |)+n log(2π)] y y y 2 which is also computable in O(m2 n) time. Model selection will be performed by jointly optimising the evidence with respect to the hyperparameters and the inducing features. If analytical derivatives of the covariance function are available, conjugate gradient optimisation can be used with O(m2 n) cost per step. 3.3 On the choice of g(x, z) The feature extraction function g(x, z) deﬁnes the transformed domain in which the pseudo data set lies. According to (3), the inducing variables can be seen as projections of the target function f (x) on the feature extraction function over the whole input space. Therefore, each of them summarises information about the behaviour of f (x) everywhere. The inducing features Z deﬁne the concrete set of functions over which the target function will be projected. It is desirable that this set captures the most signiﬁcant characteristics of the function. This can be achieved either using prior knowledge about data to select {g(x, zi )}m or using a very general family of functions and letting model i=1 selection automatically choose the appropriate set. Another way to choose g(x, z) relies on the form of the posterior. The posterior mean of a GP is often thought of as a linear combination of “basis functions”. For full GPs and other approximations such as [1, 2, 3, 4, 5, 6], basis functions must have the form of the input-domain covariance function. When using IDGPs, basis functions have the form of the inter-domain instance of the covariance function, and can therefore be adjusted by choosing g(x, z), independently of the input-domain covariance function. If two feature extraction functions g(·, ·) and h(·, ·) can be related by g(x, z) = h(x, z)r(z) for any function r(·), then both yield the same sparse GP model. This property can be used to simplify the expressions of the instances of the covariance function. In this work we use the same functional form for every feature, i.e. our function set is {g(x, zi )}m , i=1 but it is also possible to use sets with different functional forms for each inducing feature, i.e. {gi (x, zi )}m where each zi may even have a different size (dimension). In the sections below i=1 we will discuss different possible choices for g(x, z). 3.3.1 Relation with Sparse GPs using pseudo-inputs The sparse GP using pseudo-inputs (SPGP) was introduced in [1] and was later renamed to Fully Independent Training Conditional (FITC) model to ﬁt in the systematic framework of [10]. Since 4 the sparse model introduced in Section 3.2 also uses a fully independent training conditional, we will stick to the ﬁrst name to avoid possible confusion. IDGP innovation with respect to SPGP consists in letting the pseudo data set lie in a different domain. If we set gSPGP (x, z) ≡ δ(x − z) where δ(·) is a Dirac delta, we force the pseudo data set to lie in the input domain. Thus there is no longer a transformed space and the original SPGP model is retrieved. In this setting, the inducing features of IDGP play the role of SPGP’s pseudo-inputs. 3.3.2 Relation with Sparse Multiscale GPs Sparse Multiscale GPs (SMGPs) are presented in [11]. Seeking to generalise the SPGP model with ARD SE covariance function, they propose to use a different set of length-scales for each basis function. The resulting model presents a defective variance that is healed by adding heteroscedastic white noise. SMGPs, including the variance improvement, can be derived in a principled way as IDGPs: D 1 gSMGP (x, z) ≡ D d=1 2π(c2 d D kSMGP (x, z ) = exp − d=1 D kSMGP (z, z ) = exp − d=1 − 2) d exp − d=1 (xd − µd )2 2cd2 D µ c with z = 2 d cd2 d=1 (µd − µd )2 2(c2 + cd2 − 2 ) d d (xd − µd )2 2(c2 − 2 ) d d (9) (10) D c2 + d d=1 2 d cd2 − 2. d (11) With this approximation, each basis function has its own centre µ = [µ1 , µ2 , . . . , µd ] and its own length-scales c = [c1 , c2 , . . . , cd ] , whereas global length-scales { d }D are shared by all d=1 inducing features. Equations (10) and (11) are derived from (4) and (5) using (1) and (9). The integrals deﬁning kSMGP (·, ·) converge if and only if c2 ≥ 2 , ∀d , which suggests that other values, d d even if permitted in [11], should be avoided for the model to remain well deﬁned. 3.3.3 Frequency Inducing Features GP If the target function can be described more compactly in the frequency domain than in the input domain, it can be advantageous to let the pseudo data set lie in the former domain. We will pursue that possibility for the case where the input domain covariance is the ARD SE. We will call the resulting sparse model Frequency Inducing Features GP (FIFGP). Directly applying the Fourier transform is not possible because the target function is not square 2 integrable (it has constant power σ0 everywhere, so (5) does not converge). We will workaround this by windowing the target function in the region of interest. It is possible to use a square window, but this results in the covariance being deﬁned in terms of the complex error function, which is very slow to evaluate. Instead, we will use a Gaussian window3 . Since multiplying by a Gaussian in the input domain is equivalent to convolving with a Gaussian in the frequency domain, we will be working with a blurred version of the frequency space. This model is deﬁned by: gFIF (x, z) ≡ D 1 D d=1 2πc2 d D kFIF (x, z ) = exp − d=1 D kFIF (z, z ) = exp − d=1 exp − d=1 x2 d cos ω0 + 2c2 d x2 + c2 ωd2 d d cos ω0 + 2(c2 + 2 ) d d D d=1 exp − d=1 D + exp 3 c4 (ωd + ωd )2 d − 2(2c2 + 2 ) d d d=1 x d ωd with z = ω D 2 d d=1 c2 + d 2 d (13) c4 (ωd − ωd )2 d cos(ω0 − ω0 ) 2(2c2 + 2 ) d d D cos(ω0 + ω0 ) d=1 2 d 2c2 + d 2. d A mixture of m Gaussians could also be used as window without increasing the complexity order. 5 (12) d=1 c2 ωd xd d c2 + 2 d d D 2 c2 (ωd + ωd2 ) d 2(2c2 + 2 ) d d D (14) The inducing features are ω = [ω0 , ω1 , . . . , ωd ] , where ω0 is the phase and the remaining components are frequencies along each dimension. In this model, both global length-scales { d }D and d=1 window length-scales {cd }D are shared, thus cd = cd . Instances (13) and (14) are induced by (12) d=1 using (4) and (5). 3.3.4 Time-Frequency Inducing Features GP Instead of using a single window to select the region of interest, it is possible to use a different window for each feature. We will use windows of the same size but different centres. The resulting model combines SPGP and FIFGP, so we will call it Time-Frequency Inducing Features GP (TFIFGP). It is deﬁned by gTFIF (x, z) ≡ gFIF (x − µ, ω), with z = [µ ω ] . The implied inter-domain and transformed-domain instances of the covariance function are: D kTFIF (x, z ) = kFIF (x − µ , ω ) , kTFIF (z, z ) = kFIF (z, z ) exp − d=1 (µd − µd )2 2(2c2 + 2 ) d d FIFGP is trivially obtained by setting every centre to zero {µi = 0}m , whereas SPGP is obtained i=1 by setting window length-scales c, frequencies and phases {ω i }m to zero. If the window lengthi=1 scales were individually adjusted, SMGP would be obtained. While FIFGP has the modelling power of both FIFGP and SPGP, it might perform worse in practice due to it having roughly twice as many hyperparameters, thus making the optimisation problem harder. The same problem also exists in SMGP. A possible workaround is to initialise the hyperparameters using a simpler model, as done in [11] for SMGP, though we will not do this here. 4 Experiments In this section we will compare the proposed approximations FIFGP and TFIFGP with the current state of the art, SPGP on some large data sets, for the same number of inducing features/inputs and therefore, roughly equal computational cost. Additionally, we provide results using a full GP, which is expected to provide top performance (though requiring an impractically big amount of computation). In all cases, the (input-domain) covariance function is the ARD SE (1). We use four large data sets: Kin-40k, Pumadyn-32nm4 (describing the dynamics of a robot arm, used with SPGP in [1]), Elevators and Pole Telecomm5 (related to the control of the elevators of an F16 aircraft and a telecommunications problem, and used in [12, 13, 14]). Input dimensions that remained constant throughout the training set were removed. Input data was additionally centred for use with FIFGP (the remaining methods are translation invariant). Pole Telecomm outputs actually take discrete values in the 0-100 range, in multiples of 10. This was taken into account by using the corresponding quantization noise variance (102 /12) as lower bound for the noise hyperparameter6 . n 1 2 2 2 Hyperparameters are initialised as follows: σ0 = n j=1 yj , σ 2 = σ0 /4, { d }D to one half of d=1 the range spanned by training data along each dimension. For SPGP, pseudo-inputs are initialised to a random subset of the training data, for FIFGP window size c is initialised to the standard deviation of input data, frequencies are randomly chosen from a zero-mean −2 -variance Gaussian d distribution, and phases are obtained from a uniform distribution in [0 . . . 2π). TFIFGP uses the same initialisation as FIFGP, with window centres set to zero. Final values are selected by evidence maximisation. Denoting the output average over the training set as y and the predictive mean and variance for test sample y∗l as µ∗l and σ∗l respectively, we deﬁne the following quality measures: Normalized Mean Square Error (NMSE) (y∗l − µ∗l )2 / (y∗l − y)2 and Mean Negative Log-Probability (MNLP) 1 2 2 2 2 (y∗l − µ∗l ) /σ∗l + log σ∗l + log 2π , where · averages over the test set. 4 Kin-40k: 8 input dimensions, 10000/30000 samples for train/test, Pumadyn-32nm: 32 input dimensions, 7168/1024 samples for train/test, using exactly the same preprocessing and train/test splits as [1, 3]. Note that their error measure is actually one half of the Normalized Mean Square Error deﬁned here. 5 Pole Telecomm: 26 non-constant input dimensions, 10000/5000 samples for train/test. Elevators: 17 non-constant input dimensions, 8752/7847 samples for train/test. Both have been downloaded from http://www.liaad.up.pt/∼ltorgo/Regression/datasets.html 6 If unconstrained, similar plots are obtained; in particular, no overﬁtting is observed. 6 For Kin-40k (Fig. 1, top), all three sparse methods perform similarly, though for high sparseness (the most useful case) FIFGP and TFIFGP are slightly superior. In Pumadyn-32nm (Fig. 1, bottom), only 4 out the 32 input dimensions are relevant to the regression task, so it can be used as an ARD capabilities test. We follow [1] and use a full GP on a small subset of the training data (1024 data points) to obtain the initial length-scales. This allows better minima to be found during optimisation. Though all methods are able to properly ﬁnd a good solution, FIFGP and especially TFIFGP are better in the sparser regime. Roughly the same considerations can be made about Pole Telecomm and Elevators (Fig. 2), but in these data sets the superiority of FIFGP and TFIFGP is more dramatic. Though not shown here, we have additionally tested these models on smaller, overﬁtting-prone data sets, and have found no noticeable overﬁtting even using m > n, despite the relatively high number of parameters being adjusted. This is in line with the results and discussion of [1]. Mean Negative Log−Probability Normalized Mean Squared Error 2.5 0.5 0.1 0.05 0.01 0.005 SPGP FIFGP TFIFGP Full GP on 10000 data points 0.001 25 50 100 200 300 500 750 SPGP FIFGP TFIFGP Full GP on 10000 data points 2 1.5 1 0.5 0 −0.5 −1 −1.5 1250 25 50 100 200 300 500 750 1250 Inducing features / pseudo−inputs (b) Kin-40k MNLP (semilog plot) Inducing features / pseudo−inputs (a) Kin-40k NMSE (log-log plot) 0.05 0.04 10 25 50 75 Mean Negative Log−Probability Normalized Mean Squared Error 0.2 SPGP FIFGP TFIFGP Full GP on 7168 data points 0.1 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 100 Inducing features / pseudo−inputs (c) Pumadyn-32nm NMSE (log-log plot) SPGP FIFGP TFIFGP Full GP on 7168 data points 0.15 10 25 50 75 100 Inducing features / pseudo−inputs (d) Pumadyn-32nm MNLP (semilog plot) Figure 1: Performance of the compared methods on Kin-40k and Pumadyn-32nm. 5 Conclusions and extensions In this work we have introduced IDGPs, which are able combine representations of a GP in different domains, and have used them to extend SPGP to handle inducing features lying in a different domain. This provides a general framework for sparse models, which are deﬁned by a feature extraction function. Using this framework, SMGPs can be reinterpreted as fully principled models using a transformed space of local features, without any need for post-hoc variance improvements. Furthermore, it is possible to develop new sparse models of practical use, such as the proposed FIFGP and TFIFGP, which are able to outperform the state-of-the-art SPGP on some large data sets, especially for high sparsity regimes. 7 0.25 0.2 0.15 0.1 10 25 50 100 250 Mean Negative Log−Probability Normalized Mean Squared Error −3.8 SPGP FIFGP TFIFGP Full GP on 8752 data points SPGP FIFGP TFIFGP Full GP on 8752 data points −4 −4.2 −4.4 −4.6 −4.8 500 750 1000 10 Inducing features / pseudo−inputs (a) Elevators NMSE (log-log plot) 25 50 100 250 500 750 1000 Inducing features / pseudo−inputs (b) Elevators MNLP (semilog plot) 0.2 0.15 0.1 0.05 0.04 0.03 0.02 0.01 10 25 50 100 250 500 Mean Negative Log−Probability Normalized Mean Squared Error 5.5 SPGP FIFGP TFIFGP Full GP on 10000 data points 4.5 4 3.5 3 2.5 1000 SPGP FIFGP TFIFGP Full GP on 10000 data points 5 10 25 50 100 250 500 1000 Inducing features / pseudo−inputs (d) Pole Telecomm MNLP (semilog plot) Inducing features / pseudo−inputs (c) Pole Telecomm NMSE (log-log plot) Figure 2: Performance of the compared methods on Elevators and Pole Telecomm. Choosing a transformed space for the inducing features enables to use domains where the target function can be expressed more compactly, or where the evidence (which is a function of the features) is easier to optimise. This added ﬂexibility translates as a detaching of the functional form of the input-domain covariance and the set of basis functions used to express the posterior mean. IDGPs approximate full GPs optimally in the KL sense noted in Section 3.2, for a given set of inducing features. Using ML-II to select the inducing features means that models providing a good ﬁt to data are given preference over models that might approximate the full GP more closely. This, though rarely, might lead to harmful overﬁtting. To more faithfully approximate the full GP and avoid overﬁtting altogether, our proposal can be combined with the variational approach from [15], in which the inducing features would be regarded as variational parameters. This would result in more constrained models, which would be closer to the full GP but might show reduced performance. We have explored the case of regression with Gaussian noise, which is analytically tractable, but it is straightforward to apply the same model to other tasks such as robust regression or classiﬁcation, using approximate inference (see [16]). Also, IDGPs as a general tool can be used for other purposes, such as modelling noise in the frequency domain, aggregating data from different domains or even imposing constraints on the target function. Acknowledgments We would like to thank the anonymous referees for helpful comments and suggestions. This work has been partly supported by the Spanish government under grant TEC2008- 02473/TEC, and by the Madrid Community under grant S-505/TIC/0223. 8 References [1] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, pages 1259–1266. MIT Press, 2006. [2] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems 13, pages 619–625. MIT Press, 2001. [3] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Proceedings of the 9th International Workshop on AI Stats, 2003. [4] V. Tresp. A Bayesian committee machine. Neural Computation, 12:2719–2741, 2000. [5] L. Csat´ and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(3):641–669, 2002. o [6] C. K. I. Williams and M. Seeger. Using the Nystr¨ m method to speed up kernel machines. In Advances o in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001. [7] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2006. [8] M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In Advances in Neural Information Processing Systems 21, pages 57–64, 2009. [9] Ed. Snelson. Flexible and efﬁcient Gaussian process models for machine learning. PhD thesis, University of Cambridge, 2007. [10] J. Qui˜ onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process n regression. Journal of Machine Learning Research, 6:1939–1959, 2005. [11] C. Walder, K. I. Kim, and B. Sch¨ lkopf. Sparse multiscale Gaussian process regression. In 25th Internao tional Conference on Machine Learning. ACM Press, New York, 2008. [12] G. Potgietera and A. P. Engelbrecht. Evolving model trees for mining data sets with continuous-valued classes. Expert Systems with Applications, 35:1513–1532, 2007. [13] L. Torgo and J. Pinto da Costa. Clustered partial linear regression. In Proceedings of the 11th European Conference on Machine Learning, pages 426–436. Springer, 2000. [14] G. Potgietera and A. P. Engelbrecht. Pairwise classiﬁcation as an ensemble technique. In Proceedings of the 13th European Conference on Machine Learning, pages 97–110. Springer-Verlag, 2002. [15] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Workshop on AI Stats, 2009. [16] A. Naish-Guzman and S. Holden. The generalized FITC approximation. In Advances in Neural Information Processing Systems 20, pages 1057–1064. MIT Press, 2008. 9

6 0.60187602 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

7 0.60120481 100 nips-2009-Gaussian process regression with Student-t likelihood

8 0.60090071 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

9 0.59936994 113 nips-2009-Improving Existing Fault Recovery Policies

10 0.59410155 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

11 0.59381545 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

12 0.59372687 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

13 0.59335768 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

14 0.59259188 97 nips-2009-Free energy score space

15 0.59254426 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

16 0.59185302 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

17 0.59157199 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

18 0.59149432 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

19 0.58990067 3 nips-2009-AUC optimization and the two-sample problem

20 0.58931905 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models