nips nips2005 nips2005-179 knowledge-graph by maker-knowledge-mining

179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs


Source: pdf

Author: Edward Snelson, Zoubin Ghahramani

Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M , i.e. very sparse solutions, and it significantly outperforms other approaches in this regime. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. [sent-4, score-0.398]

2 We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. [sent-5, score-0.304]

3 We also find hyperparameters of the covariance function in the same joint optimization. [sent-6, score-0.381]

4 The method can be viewed as a Bayesian regression model with particular input dependent noise. [sent-7, score-0.17]

5 The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. [sent-8, score-0.166]

6 In recent years there have been many attempts to make sparse approximations to the full GP in order to bring this scaling down to M 2 N where M N [1, 2, 3, 4, 5, 6, 7, 8, 9]. [sent-15, score-0.202]

7 Most of these methods involve selecting a subset of the training points of size M (active set) on which to base computation. [sent-16, score-0.146]

8 [7] employ a very fast approximate information gain criterion, which they use to greedily select points into the active set. [sent-19, score-0.181]

9 A major common problem to these methods is that they lack a reliable way of learning kernel hyperparameters, because the active set selection interferes with this learning procedure. [sent-20, score-0.158]

10 [7] construct an approximation to the full GP marginal likelihood, which they try to maximize to find the hyperparameters. [sent-22, score-0.294]

11 The reason for this is that reselecting the active set causes non-smooth fluctuations in the marginal likelihood and its gradients, meaning that they cannot get smooth convergence. [sent-24, score-0.359]

12 Inappropriately learned hyperparameters will adversely affect the quality of solution, especially if one is trying to use them for automatic relevance determination (ARD) [10]. [sent-26, score-0.297]

13 In this paper we circumvent this problem by constructing a GP regression model that enables us to find active set point locations and hyperparameters in one smooth joint optimization. [sent-27, score-0.521]

14 The covariance function of our GP is parameterized by the locations of pseudo-inputs — an active set not constrained to be a subset of the data, found by a continuous optimization. [sent-28, score-0.318]

15 Our model is closely related to several sparse GP approximations, in particular Seeger’s method of projected latent variables (PLV) [7, 8]. [sent-30, score-0.186]

16 In principle we could also apply our technique of moving active set points off data points to approximations such as PLV. [sent-32, score-0.266]

17 We therefore have a multivariate Gaussian distribution on any finite subset of latent variables; in particular, at X: p(f |X) = N (f |0, KN ), where N (f |m, V) is a Gaussian distribution with mean m and covariance V. [sent-38, score-0.18]

18 In a Gaussian process the covariance matrix is constructed from a covariance function, or kernel, K which expresses some prior notion of smoothness of the underlying function: [KN ]nn = K(xn , xn ). [sent-39, score-0.227]

19 Usually the covariance function depends on a small number of hyperparameters θ, which control these smoothness properties. [sent-40, score-0.381]

20 For our experiments later on we will use the standard Gaussian covariance with ARD hyperparameters: 1 K(xn , xn ) = c exp − 2 D d=1 (d) 2 bd x(d) − xn n , θ = {c, b} . [sent-41, score-0.202]

21 (1) In standard GP regression we also assume a Gaussian noise model or likelihood p(y|f ) = N (y|f , σ 2 I). [sent-42, score-0.239]

22 Integrating out the latent function values we obtain the marginal likelihood: p(y|X, θ) = N (y|0, KN + σ 2 I) , (2) which is typically used to train the GP by finding a (local) maximum with respect to the hyperparameters θ and σ 2 . [sent-43, score-0.457]

23 The distribution of the target value at the new point is then: p(y|x, D, θ) = N y kx (KN + σ 2 I)−1 y, Kxx − kx (KN + σ 2 I)−1 kx + σ 2 , (3) where [kx ]n = K(xn , x) and Kxx = K(x, x). [sent-45, score-0.446]

24 The GP is a non-parametric model, because the training data are explicitly required at test time in order to construct the predictive distribution, as is clear from the above expression. [sent-46, score-0.163]

25 GPs are prohibitive for large data sets because training requires O(N 3 ) time due to the inversion of the covariance matrix. [sent-47, score-0.158]

26 Once the inversion is done, prediction is O(N ) for the predictive mean and O(N 2 ) for the predictive variance per new test case. [sent-48, score-0.315]

27 2 Sparse Pseudo-input Gaussian processes (SPGPs) In order to derive a sparse model that is computationally tractable for large data sets, which still preserves the desirable properties of the full GP, we examine in detail the GP predictive distribution (3). [sent-49, score-0.329]

28 Regarding the hyperparameters as known and fixed for now, these functions are effectively parameterized by the locations of the N training input and target pairs, X and y. [sent-51, score-0.504]

29 In this paper we consider a model with likelihood given by the GP predictive distribution, and parameterized by a pseudo data set. [sent-52, score-0.619]

30 The sparsity in the model will arise ¯ because we will generally consider a pseudo data set D of size M < N : pseudo-inputs ¯ ¯ = {¯ m }M and pseudo targets ¯ = {fm }M . [sent-53, score-0.728]

31 We have denoted the pseudo targets X x m=1 f m=1 ¯ instead of y because as they are not real observations, it does not make much sense ¯ f to include a noise variance for them. [sent-54, score-0.457]

32 These assumptions therefore lead to the following single data point likelihood: ¯ f p(y|x, X, ¯) = N y kx K−1 ¯, Kxx − kx K−1 kx + σ 2 , M f M (4) where [KM ]mm = K(¯ m , xm ) and [kx ]m = K(¯ m , x), for m = 1, . [sent-57, score-0.447]

33 x ¯ x This can be viewed as a standard regression model with a particular form of parameterized mean function and input-dependent noise model. [sent-61, score-0.195]

34 given the inputs, giving the complete data likelihood: ¯ f p(y|X, X, ¯) = N n=1 ¯ f p(yn |xn , X, ¯) = N (y|KNM K−1 ¯, Λ + σ 2 I) , M f (5) ¯ where Λ = diag(λ), λn = Knn − kn K−1 kn , and [KNM ]nm = K(xn , xm ). [sent-65, score-0.225]

35 M Learning in the model involves finding a suitable setting of the parameters – an appropriate pseudo data set that explains the real data well. [sent-66, score-0.361]

36 However rather than simply maximize the ¯ likelihood with respect to X and ¯ it turns out that we can integrate out the pseudo targets f ¯. [sent-67, score-0.556]

37 We place a Gaussian prior on the pseudo targets: f p(¯|X) = N (¯|0, KM ) . [sent-68, score-0.352]

38 f ¯ f (6) This is a very reasonable prior because we expect the pseudo data to be distributed in a very similar manner to the real data, if they are to model them well. [sent-69, score-0.361]

39 It is not easy to place a prior on the pseudo-inputs and still remain with a tractable model, so we will find these by maximum likelihood (ML). [sent-70, score-0.159]

40 We find the posterior distribution over pseudo targets ¯ using Bayes rule on (5) and (6): f ¯ p(¯|D, X) = N ¯|KM Q−1 KMN (Λ + σ 2 I)−1 y, KM Q−1 KM , f f M M (7) where QM = KM + KMN (Λ + σ 2 I)−1 KNM . [sent-72, score-0.397]

41 Given a new input x∗ , the predictive distribution is then obtained by integrating the likelihood (4) with the posterior (7): ¯ p(y∗ |x∗ , D, X) = where 2 ¯ f ¯ d¯ p(y∗ |x∗ , X, ¯) p(¯|D, X) = N (y∗ |µ∗ , σ∗ ) , f f (8) µ∗ = k∗ Q−1 KMN (Λ + σ 2 I)−1 y M 2 σ∗ = K∗∗ − k∗ (K−1 − Q−1 )k∗ + σ 2 . [sent-73, score-0.269]

42 (a) y (b) (c) y y x x x Figure 1: Predictive distributions (mean and two standard deviation lines) for: (a) full GP, (b) SPGP trained using gradient ascent on (9), (c) SPGP trained using gradient ascent on (10). [sent-77, score-0.477]

43 Initial pseudo point positions are shown at the top as red crosses; final pseudo point positions are shown at the bottom as blue crosses (the y location on the plots of these crosses is not meaningful). [sent-78, score-0.992]

44 ¯ We are left with the problem of finding the pseudo-input locations X and hyperparameters 2 Θ = {θ, σ }. [sent-79, score-0.368]

45 We can do this by computing the marginal likelihood from (5) and (6): ¯ p(y|X, X, Θ) = ¯ f d¯ p(y|X, X, ¯) p(¯|X) f f ¯ (9) = N (y|0, KNM K−1 KMN + Λ + σ 2 I) . [sent-80, score-0.257]

46 M The marginal likelihood can then be maximized with respect to all these parameters ¯ {X, Θ} by gradient ascent. [sent-81, score-0.352]

47 They closely follow the derivations of hyperparameter gradients of Seeger et al. [sent-83, score-0.149]

48 The exact form of the gradients will of course depend on the functional form of the covariance function chosen, but our method will apply to any covariance that is differentiable with respect to the input points. [sent-86, score-0.28]

49 It is worth saying that the SPGP can be viewed as a standard GP with a particular non-stationary covariance function parameterized by the pseudo-inputs. [sent-87, score-0.177]

50 At this point the marginal likelihood is equal to that of a full GP (2). [sent-90, score-0.35]

51 Moreover the predictive distribution (8) also collapses to the full GP predictive distribution (3). [sent-92, score-0.293]

52 3 Relation to other methods It turns out that Seeger’s method of PLV [7, 8] uses a very similar marginal likelihood approximation and predictive distribution. [sent-95, score-0.449]

53 In particular the marginal likelihood they use is: ¯ p(y|X, X, Θ) = N (y|0, KNM K−1 KMN + σ 2 I) , (10) M which has also been used elsewhere before [1, 4, 5]. [sent-97, score-0.257]

54 They have derived this expression from a somewhat different route, as a direct approximation to the full GP marginal likelihood. [sent-98, score-0.233]

55 (a) y (b) y x (c) y x x Figure 2: Sample data drawn from the marginal likelihood of: (a) a full GP, (b) SPGP, (c) PLV. [sent-99, score-0.35]

56 As discussed earlier, the major difference between our method and these other methods, is that they do not use this marginal likelihood to learn locations of active set input points – only the hyperparameters are learnt from (10). [sent-101, score-0.903]

57 This begged the question of what would happen if we tried to use their marginal likelihood approximation (10) instead of (9) to try to learn pseudo-input locations by gradient ascent. [sent-102, score-0.557]

58 We show that the Λ that appears in the SPGP marginal likelihood (9) is crucial for finding pseudo-input points by gradients. [sent-103, score-0.315]

59 Figure 1 shows what happens when we try to optimize these two likelihoods using gradient ascent with respect to the pseudo inputs, on a simple 1D data set. [sent-104, score-0.591]

60 Plotted are the predictive distributions, initial and final locations of the pseudo inputs. [sent-105, score-0.502]

61 Using the SPGP likelihood, the pseudo-inputs spread themselves along the extent of the training data, and the predictive distribution matches the full GP very closely (Figure 1(b)). [sent-108, score-0.282]

62 Using the PLV likelihood, the points begin to spread, but very quickly become stuck as the gradient pushing the points towards the right becomes tiny (Figure 1(c)). [sent-109, score-0.239]

63 Figure 2 compares data sampled from the marginal likelihoods (9) and (10), given a particular setting of the hyperparameters and a small number of pseudo-input points. [sent-110, score-0.456]

64 The major difference between the two is that the SPGP likelihood has a constant marginal variance of Knn + σ 2 , whereas the PLV decreases to σ 2 away from the pseudo-inputs. [sent-111, score-0.284]

65 Alternatively, the noise component of the PLV likelihood is a constant σ 2 , whereas the SPGP noise grows to Knn + σ 2 away from the pseudo-inputs. [sent-112, score-0.198]

66 Therefore the points become stuck, and we believe this effect accounts for the failure of the PLV likelihood in Figure 1(c). [sent-116, score-0.196]

67 It should be emphasised that the global optimum of the PLV likelihood (10) may well be a good solution, but it is going to be difficult to find with gradients. [sent-117, score-0.138]

68 The SPGP likelihood (9) also suffers from local optima of course, but not so catastrophically. [sent-118, score-0.179]

69 For kin-40k the squares show SPGP with hyperparameters obtained from a full GP and fixed. [sent-125, score-0.424]

70 For pumadyn-32nm the squares show hyperparameters initialized from a full GP. [sent-126, score-0.424]

71 The horizontal lines are a full GP trained on a subset of the data. [sent-128, score-0.18]

72 random involves picking an active set of size M randomly from among training data. [sent-135, score-0.144]

73 info-gain is their own greedy subset selection method, which is extremely cheap to train – barely more expensive than random. [sent-136, score-0.154]

74 Also shown with horizontal lines is the test error for a full GP trained on a subset of the data of size 2000 for data set kin-40k and 1024 for pumadyn-32nm. [sent-138, score-0.208]

75 For these learning curves, they do not actually learn hyperparameters by maximizing their approximation to the marginal likelihood (10). [sent-139, score-0.602]

76 ’s procedure of setting the hyperparameters from the full GP on a subset. [sent-142, score-0.39]

77 We rapidly approach the error of a full GP trained on 2000 points, using a pseudo set of only a few hundred points. [sent-145, score-0.457]

78 We then try the harder task of also finding the hyperparameters at the same time as the pseudo-inputs. [sent-146, score-0.358]

79 have a separate section testing their likelihood approximation (10) to learn hyperparameters, in conjunction with the active set selection methods. [sent-157, score-0.295]

80 They show that it can be used to reliably learn hyperparameters with info-gain for active set sizes of 100 and above. [sent-158, score-0.406]

81 They have more trouble reliably learning hyperparameters for very small active sets. [sent-159, score-0.379]

82 y x x behaviour for large M which seems to be caused by the noise hyperparameter being driven too small (the blue circles have higher likelihood than the red squares below them). [sent-166, score-0.34]

83 For data set pumadyn-32nm, we again try to jointly find hyperparameters and pseudoinputs. [sent-167, score-0.358]

84 Again Figure 3 shows SPGP with extremely low error for small pseudo set size – with just 10 pseudo-inputs we are already close to the error of a full GP trained on 1024 points. [sent-168, score-0.493]

85 However, in this case increasing the pseudo set size does not decrease our error. [sent-169, score-0.331]

86 Although the hyperparameters learnt by our method are reasonable (2 out of the 4 relevant dimensions are found), they are not good enough to get down to the error of the full GP. [sent-171, score-0.443]

87 However if we initialize our gradient algorithm with the hyperparameters of the full GP, we get the points plotted as squares (this time red likelihoods > blue likelihoods, so it is a problem of local optima not overfitting). [sent-172, score-0.761]

88 Now with only a pseudo set of size 25 we reach the performance of the full GP, and significantly outperform the other methods (which also had their hyperparameters set from the full GP). [sent-173, score-0.836]

89 However all these methods must be combined with obtaining hyperparameters in some way – either by a full GP on a subset (generally expensive), or by gradient ascent on an approximation to the likelihood. [sent-177, score-0.623]

90 When you consider this combined task, and that all methods involve some kind of gradient based procedure, then none of the methods are particularly cheap. [sent-178, score-0.139]

91 We believe that the gain in accuracy achieved by our method can often be worth the extra training time associated with optimizing in a larger parameter space. [sent-179, score-0.155]

92 5 Conclusions, extensions and future work Although GPs are very flexible regression models, they are still limited by the form of the covariance function. [sent-180, score-0.155]

93 For example it is difficult to model non-stationary processes with a GP because it is hard to construct sensible non-stationary covariance functions. [sent-181, score-0.157]

94 Although the SPGP is not specifically designed to model non-stationarity, the extra flexibility associated with moving pseudo inputs around can actually achieve this to a certain extent. [sent-182, score-0.378]

95 All runs had higher likelihood than the GP; the one with the highest likelihood is plotted. [sent-190, score-0.276]

96 pseudo set size and/or high dimensional input spaces, because the space in which we are optimizing becomes impractically big. [sent-191, score-0.39]

97 For example we can try optimizing subsets of variables iteratively (chunking), or stochastic gradient ascent, or we could make a hybrid by picking some points randomly and optimizing others. [sent-193, score-0.297]

98 In conclusion, we have presented a new method for sparse GP regression, which shows a significant performance gain over other methods especially when searching for an extremely sparse solution. [sent-197, score-0.303]

99 We have shown that the added flexibility of moving pseudo-input points which are not constrained to lie on the true data points leads to better solutions, and even some non-stationary effects can be modelled. [sent-198, score-0.163]

100 Finally we have shown that hyperparameters can be jointly learned with pseudo-input points with reasonable success. [sent-199, score-0.355]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('spgp', 0.448), ('gp', 0.429), ('pseudo', 0.331), ('hyperparameters', 0.297), ('kmn', 0.179), ('plv', 0.175), ('seeger', 0.15), ('kx', 0.142), ('likelihood', 0.138), ('knm', 0.134), ('km', 0.125), ('marginal', 0.119), ('kn', 0.102), ('predictive', 0.1), ('gradient', 0.095), ('full', 0.093), ('crosses', 0.089), ('sparse', 0.088), ('covariance', 0.084), ('active', 0.082), ('regression', 0.071), ('locations', 0.071), ('kxx', 0.067), ('knn', 0.066), ('targets', 0.066), ('gaussian', 0.064), ('ascent', 0.064), ('hyperparameter', 0.062), ('try', 0.061), ('xn', 0.059), ('points', 0.058), ('gps', 0.051), ('parameterized', 0.05), ('processes', 0.048), ('moving', 0.047), ('info', 0.047), ('adversarially', 0.045), ('bart', 0.045), ('qm', 0.045), ('phd', 0.043), ('gain', 0.041), ('latent', 0.041), ('optima', 0.041), ('likelihoods', 0.04), ('inversion', 0.039), ('blue', 0.039), ('thesis', 0.038), ('red', 0.037), ('extremely', 0.036), ('attributes', 0.036), ('csat', 0.036), ('training', 0.035), ('expensive', 0.034), ('squares', 0.034), ('trained', 0.033), ('smo', 0.033), ('ard', 0.033), ('williams', 0.031), ('gradients', 0.031), ('subset', 0.031), ('input', 0.031), ('noise', 0.03), ('plots', 0.03), ('reproduced', 0.03), ('real', 0.03), ('closely', 0.029), ('stuck', 0.028), ('optimizing', 0.028), ('test', 0.028), ('method', 0.028), ('learn', 0.027), ('picking', 0.027), ('bishop', 0.027), ('plotted', 0.027), ('selection', 0.027), ('major', 0.027), ('et', 0.027), ('greedy', 0.026), ('sensible', 0.025), ('learnt', 0.025), ('tried', 0.025), ('spread', 0.025), ('mean', 0.024), ('prediction', 0.024), ('positions', 0.023), ('lines', 0.023), ('bayesian', 0.023), ('worth', 0.023), ('course', 0.022), ('nding', 0.022), ('methods', 0.022), ('approximation', 0.021), ('turns', 0.021), ('xm', 0.021), ('place', 0.021), ('approximations', 0.021), ('causes', 0.02), ('dependent', 0.02), ('viewed', 0.02), ('target', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

Author: Edward Snelson, Zoubin Ghahramani

Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M , i.e. very sparse solutions, and it significantly outperforms other approaches in this regime. 1

2 0.31106418 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

Author: Edward Meeds, Simon Osindero

Abstract: We present an infinite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian specification of the effective ‘gating network’ for the different experts. 1

3 0.2108234 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

Author: Sathiya Keerthi, Wei Chu

Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1

4 0.17571107 30 nips-2005-Assessing Approximations for Gaussian Process Classification

Author: Malte Kuss, Carl E. Rasmussen

Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.

5 0.12967639 80 nips-2005-Gaussian Process Dynamical Models

Author: Jack Wang, Aaron Hertzmann, David M. Blei

Abstract: This paper introduces Gaussian Process Dynamical Models (GPDM) for nonlinear time series analysis. A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space. We marginalize out the model parameters in closed-form, using Gaussian Process (GP) priors for both the dynamics and the observation mappings. This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. We demonstrate the approach on human motion capture data in which each pose is 62-dimensional. Despite the use of small data sets, the GPDM learns an effective representation of the nonlinear dynamics in these spaces. Webpage: http://www.dgp.toronto.edu/∼ jmwang/gpdm/ 1

6 0.10471889 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

7 0.094638355 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

8 0.090695806 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface

9 0.087176345 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

10 0.084618807 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

11 0.084459402 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

12 0.075962633 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

13 0.06568341 105 nips-2005-Large-Scale Multiclass Transduction

14 0.065361552 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

15 0.062907591 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

16 0.062457718 19 nips-2005-Active Learning for Misspecified Models

17 0.062015865 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

18 0.061943047 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

19 0.061116088 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

20 0.060924146 48 nips-2005-Context as Filtering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, 0.046), (2, -0.022), (3, 0.072), (4, 0.165), (5, -0.222), (6, 0.102), (7, -0.114), (8, 0.094), (9, 0.245), (10, 0.023), (11, 0.078), (12, 0.213), (13, 0.201), (14, 0.015), (15, -0.013), (16, 0.074), (17, -0.211), (18, -0.046), (19, -0.107), (20, -0.186), (21, -0.01), (22, -0.139), (23, 0.032), (24, -0.126), (25, -0.103), (26, 0.032), (27, 0.14), (28, 0.019), (29, 0.063), (30, 0.033), (31, 0.019), (32, -0.082), (33, -0.018), (34, 0.019), (35, 0.072), (36, -0.014), (37, 0.181), (38, -0.107), (39, 0.058), (40, 0.054), (41, -0.085), (42, 0.014), (43, 0.004), (44, 0.081), (45, -0.025), (46, 0.007), (47, 0.017), (48, 0.058), (49, -0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95475674 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

Author: Edward Snelson, Zoubin Ghahramani

Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M , i.e. very sparse solutions, and it significantly outperforms other approaches in this regime. 1

2 0.82885128 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

Author: Edward Meeds, Simon Osindero

Abstract: We present an infinite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian specification of the effective ‘gating network’ for the different experts. 1

3 0.81126666 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

Author: Sathiya Keerthi, Wei Chu

Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1

4 0.56745458 30 nips-2005-Assessing Approximations for Gaussian Process Classification

Author: Malte Kuss, Carl E. Rasmussen

Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.

5 0.38446301 81 nips-2005-Gaussian Processes for Multiuser Detection in CDMA receivers

Author: Juan J. Murillo-fuentes, Sebastian Caro, Fernando Pérez-Cruz

Abstract: In this paper we propose a new receiver for digital communications. We focus on the application of Gaussian Processes (GPs) to the multiuser detection (MUD) in code division multiple access (CDMA) systems to solve the near-far problem. Hence, we aim to reduce the interference from other users sharing the same frequency band. While usual approaches minimize the mean square error (MMSE) to linearly retrieve the user of interest, we exploit the same criteria but in the design of a nonlinear MUD. Since the optimal solution is known to be nonlinear, the performance of this novel method clearly improves that of the MMSE detectors. Furthermore, the GP based MUD achieves excellent interference suppression even for short training sequences. We also include some experiments to illustrate that other nonlinear detectors such as those based on Support Vector Machines (SVMs) exhibit a worse performance. 1

6 0.36574781 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

7 0.36531165 19 nips-2005-Active Learning for Misspecified Models

8 0.36397943 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

9 0.32981855 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

10 0.32132921 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

11 0.31832296 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

12 0.31053972 105 nips-2005-Large-Scale Multiclass Transduction

13 0.29724041 80 nips-2005-Gaussian Process Dynamical Models

14 0.28879395 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

15 0.27985436 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

16 0.27201736 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

17 0.26734942 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

18 0.26651478 62 nips-2005-Efficient Estimation of OOMs

19 0.25482491 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

20 0.24950521 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.044), (10, 0.038), (18, 0.276), (27, 0.036), (31, 0.041), (34, 0.079), (39, 0.01), (41, 0.014), (55, 0.033), (69, 0.092), (73, 0.036), (88, 0.137), (91, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92676914 125 nips-2005-Message passing for task redistribution on sparse graphs

Author: K. Wong, Zhuo Gao, David Tax

Abstract: The problem of resource allocation in sparse graphs with real variables is studied using methods of statistical physics. An efficient distributed algorithm is devised on the basis of insight gained from the analysis and is examined using numerical simulations, showing excellent performance and full agreement with the theoretical results.

2 0.8664881 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis

Author: Laurent Zwald, Gilles Blanchard

Abstract: This paper presents a non-asymptotic statistical analysis of Kernel-PCA with a focus different from the one proposed in previous work on this topic. Here instead of considering the reconstruction error of KPCA we are interested in approximation error bounds for the eigenspaces themselves. We prove an upper bound depending on the spacing between eigenvalues but not on the dimensionality of the eigenspace. As a consequence this allows to infer stability results for these estimated spaces. 1 Introduction. Principal Component Analysis (PCA for short in the sequel) is a widely used tool for data dimensionality reduction. It consists in finding the most relevant lower-dimension projection of some data in the sense that the projection should keep as much of the variance of the original data as possible. If the target dimensionality of the projected data is fixed in advance, say D – an assumption that we will make throughout the present paper – the solution of this problem is obtained by considering the projection on the span SD of the first D eigenvectors of the covariance matrix. Here by ’first D eigenvectors’ we mean eigenvectors associated to the D largest eigenvalues counted with multiplicity; hereafter with some abuse the span of the first D eigenvectors will be called “D-eigenspace” for short when there is no risk of confusion. The introduction of the ’Kernel trick’ has allowed to extend this methodology to data mapped in a kernel feature space, then called KPCA [8]. The interest of this extension is that, while still linear in feature space, it gives rise to nonlinear interpretation in original space – vectors in the kernel feature space can be interpreted as nonlinear functions on the original space. For PCA as well as KPCA, the true covariance matrix (resp. covariance operator) is not known and has to be estimated from the available data, an procedure which in the case of ¨ Kernel spaces is linked to the so-called Nystrom approximation [13]. The subspace given as an output is then obtained as D-eigenspace SD of the empirical covariance matrix or operator. An interesting question from a statistical or learning theoretical point of view is then, how reliable this estimate is. This question has already been studied [10, 2] from the point of view of the reconstruction error of the estimated subspace. What this means is that (assuming the data is centered in Kernel space for simplicity) the average reconstruction error (square norm of the distance to the projection) of SD converges to the (optimal) reconstruction error of SD and that bounds are known about the rate of convergence. However, this does not tell us much about the convergence of SD to SD – since two very different subspaces can have a very similar reconstruction error, in particular when some eigenvalues are very close to each other (the gap between the eigenvalues will actually appear as a central point of the analysis to come). In the present work, we set to study the behavior of these D-eigenspaces themselves: we provide finite sample bounds describing the closeness of the D-eigenspaces of the empirical covariance operator to the true one. There are several broad motivations for this analysis. First, the reconstruction error alone is a valid criterion only if one really plans to perform dimensionality reduction of the data and stop there. However, PCA is often used merely as a preprocessing step and the projected data is then submitted to further processing (which could be classification, regression or something else). In particular for KPCA, the projection subspace in the kernel space can be interpreted as a subspace of functions on the original space; one then expects these functions to be relevant for the data at hand and for some further task (see e.g. [3]). In these cases, if we want to analyze the full procedure (from a learning theoretical sense), it is desirable to have a more precise information on the selected subspace than just its reconstruction error. In particular, from a learning complexity point of view, it is important to ensure that functions used for learning stay in a set of limited complexity, which is ensured if the selected subspace is stable (which is a consequence of its convergence). The approach we use here is based on perturbation bounds and we essentially walk in the steps pioneered by Kolchinskii and Gin´ [7] (see also [4]) using tools of operator perturbae tion theory [5]. Similar methods have been used to prove consistency of spectral clustering [12, 11]. An important difference here is that we want to study directly the convergence of the whole subspace spanned by the first D eigenvectors instead of the separate convergence of the individual eigenvectors; in particular we are interested in how D acts as a complexity parameter. The important point in our main result is that it does not: only the gap between the D-th and the (D + 1)-th eigenvalue comes into account. This means that there in no increase in complexity (as far as this bound is concerned: of course we cannot exclude that better bounds can be obtained in the future) between estimating the D-th eigenvector alone or the span of the first D eigenvectors. Our contribution in the present work is thus • to adapt the operator perturbation result of [7] to D-eigenspaces. • to get non-asymptotic bounds on the approximation error of Kernel-PCA eigenspaces thanks to the previous tool. In section 2 we introduce shortly the notation, explain the main ingredients used and obtain a first bound based on controlling separately the first D eigenvectors, and depending on the dimension D. In section 3 we explain why the first bound is actually suboptimal and derive an improved bound as a consequence of an operator perturbation result that is more adapted to our needs and deals directly with the D-eigenspace as a whole. Section 4 concludes and discusses the obtained results. Mathematical proofs are found in the appendix. 2 First result. Notation. The interest variable X takes its values in some measurable space X , following the distribution P . We consider KPCA and are therefore primarily interested in the mapping of X into a reproducing kernel Hilbert space H with kernel function k through the feature mapping ϕ(x) = k(x, ·). The objective of the kernel PCA procedure is to recover a D-dimensional subspace SD of H such that the projection of ϕ(X) on SD has maximum averaged squared norm. All operators considered in what follows are Hilbert-Schmidt and the norm considered for these operators will be the Hilbert-Schmidt norm unless precised otherwise. Furthermore we only consider symmetric nonnegative operators, so that they can be diagonalized and have a discrete spectrum. Let C denote the covariance operator of variable ϕ(X). To simplify notation we assume that nonzero eigenvalues λ1 > λ2 > . . . of C are all simple (This is for convenience only. In the conclusion we discuss what changes have to be made if this is not the case). Let φ1 , φ2 , . . . be the associated eigenvectors. It is well-known that the optimal D-dimensional reconstruction space is SD = span{φ1 , . . . , φD }. The KPCA procedure approximates this objective by considering the empirical covariance operator, denoted Cn , and the subspace SD spanned by its first D eigenvectors. We denote PSD , PSD the orthogonal projectors on b these spaces. A first bound. Broadly speaking, the main steps required to obtain the type of result we are interested in are 1. A non-asympotic bound on the (Hilbert-Schmidt) norm of the difference between the empirical and the true covariance operators; 2. An operator perturbation result bounding the difference between spectral projectors of two operators by the norm of their difference. The combination of these two steps leads to our goal. The first step consists in the following Lemma coming from [9]: Lemma 1 (Corollary 5 of [9]) Supposing that supx∈X k(x, x) ≤ M , with probability greater than 1 − e−ξ , 2M ξ Cn − C ≤ √ 1+ . 2 n As for the second step, [7] provides the following perturbation bound (see also e.g. [12]): Theorem 2 (Simplified Version of [7], Theorem 5.2 ) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple positive eigenvalues λ 1 > 1 λ2 > . . . For an integer r such that λr > 0, let δr = δr ∧ δr−1 where δr = 2 (λr − λr+1 ). Let B ∈ HS(H) be another symmetric operator such that B < δr /2 and (A + B) is still a positive operator with simple nonzero eigenvalues. Let Pr (A) (resp. Pr (A + B)) denote the orthogonal projector onto the subspace spanned by the r-th eigenvector of A (resp. (A + B)). Then, these projectors satisfy: Pr (A) − Pr (A + B) ≤ 2 B δr . Remark about the Approximation Error of the Eigenvectors: let us recall that a control over the Hilbert-Schmidt norm of the projections onto eigenspaces imply a control on the approximation errors of the eigenvectors themselves. Indeed, let φr , ψr denote the (normalized) r-th eigenvectors of the operators above with signs chosen so that φ r , ψr > 0. Then P φ r − P ψr 2 2 = 2(1 − φr , ψr ) ≥ 2(1 − φr , ψr ) = φr − ψr 2 . Now, the orthogonal projector on the direct sum of the first D eigenspaces is the sum D r=1 Pr . Using the triangle inequality, and combining Lemma 1 and Theorem 2, we conclude that with probability at least 1 − e−ξ the following holds: D P SD − P SD ≤ b −1 δr r=1 4M √ n 1+ ξ 2 , 2 provided that n ≥ 16M 2 1 + ξ 2 −2 (sup1≤r≤D δr ) . The disadvantage of this bound is that we are penalized on the one hand by the (inverse) gaps between the eigenvalues, and on the other by the dimension D (because we have to sum the inverse gaps from 1 to D). In the next section we improve the operator perturbation bound to get an improved result where only the gap δD enters into account. 3 Improved Result. We first prove the following variant on the operator perturbation property which better corresponds to our needs by taking directly into account the projection on the first D eigenvectors at once. The proof uses the same kind of techniques as in [7]. Theorem 3 Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple nonzero eigenvalues λ1 > λ2 > . . . Let D > 0 be an integer such that λD > 0, δD = 1 (λD − λD+1 ). Let B ∈ HS(H) be another symmetric operator such that 2 B < δD /2 and (A + B) is still a positive operator. Let P D (A) (resp. P D (A + B)) denote the orthogonal projector onto the subspace spanned by the first D eigenvectors A (resp. (A + B)). Then these satisfy: P D (A) − P D (A + B) ≤ B . δD (1) This then gives rise to our main result on KPCA: Theorem 4 Assume that supx∈X k(x, x) ≤ M . Let SD , SD be the subspaces spanned by the first D eigenvectors of C, resp. Cn defined earlier. Denoting λ1 > λ2 > . . . the 1 eigenvalues of C, if D > 0 is such that λD > 0, put δD = 2 (λD − λD+1 ) and BD = 2M δD 1+ ξ 2 . 2 Then provided that n ≥ BD , the following bound holds with probability at least 1 − e−ξ : BD P SD − P SD ≤ √ . b n (2) This entails in particular ⊥ SD ⊂ g + h, g ∈ SD , h ∈ SD , h 1 Hk ≤ 2BD n− 2 g Hk . (3) The important point here is that the approximation error now only depends on D through the (inverse) gap between the D-th and (D + 1)-th eigenvalues. Note that using the results of section 2, we would have obtained exactly the same bound for estimating the D-th eigenvector only – or even a worse bound since δD = δD ∧ δD−1 appears in this case. Thus, at least from the point of view of this technique (which could still yield suboptimal bounds), there is no increase of complexity between estimating the D-th eigenvector alone and estimating the span of the first D eigenvectors. Note that the inclusion (3) can be interpreted geometrically by saying that for any vector in SD , the √ tangent of the angle between this vector and its projection on SD is upper bounded by BD / n, which we can interpret as a stability property. Comment about the Centered Case. In the actual (K)PCA procedure, the data is actually first empirically recentered, so that one has to consider the centered covariance operator C and its empirical counterpart C n . A result similar to Theorem 4 also holds in this case (up to some additional constant factors). Indeed, a result similar to Lemma 1 holds for the recentered operators [2]. Combined again with Theorem 3, this allows to come to similar conclusions for the “true” centered KPCA. 4 Conclusion and Discussion In this paper, finite sample size confidence bounds of the eigenspaces of Kernel-PCA (the D-eigenspaces of the empirical covariance operator) are provided using tools of operator perturbation theory. This provides a first step towards an in-depth complexity analysis of algorithms using KPCA as pre-processing, and towards taking into account the randomness of the obtained models (e.g. [3]). We proved a bound in which the complexity factor for estimating the eigenspace SD by its empirical counterpart depends only on the inverse gap between the D-th and (D + 1)-th eigenvalues. In addition to the previously cited works, we take into account the centering of the data and obtain comparable rates. In this work we assumed for simplicity of notation the eigenvalues to be simple. In the case the covariance operator C has nonzero eigenvalues with multiplicities m1 , m2 , . . . possibly larger than one, the analysis remains the same except for one point: we have to assume that the dimension D of the subspaces considered is of the form m1 + · · · + mr for a certain r. This could seem restrictive in comparison with the results obtained for estimating the sum of the first D eigenvalues themselves [2] (which is linked to the reconstruction error in KPCA) where no such restriction appears. However, it should be clear that we need this restriction when considering D−eigenspaces themselves since the target space has to be unequivocally defined, otherwise convergence cannot occur. Thus, it can happen in this special case that the reconstruction error converges while the projection space itself does not. Finally, a common point of the two analyses (over the spectrum and over the eigenspaces) lies in the fact that the bounds involve an inverse gap in the eigenvalues of the true covariance operator. Finally, how tight are these bounds and do they at least carry some correct qualitative information about the behavior of the eigenspaces? Asymptotic results (central limit Theorems) in [6, 4] always provide the correct goal to shoot for since they actually give the limit distributions of these quantities. They imply that there is still important ground to cover before bridging the gap between asymptotic and non-asymptotic. This of course opens directions for future work. Acknowledgements: This work was supported in part by the PASCAL Network of Excellence (EU # 506778). A Appendix: proofs. Proof of Lemma 1. This lemma is proved in [9]. We give a short proof for the sake of n 1 completness. Cn − C = n i=1 CXi − E [CX ] with CX = ϕ(X) ⊗ ϕ(X)∗ = k(X, X) ≤ M . We can apply the bounded difference inequality to the variable Cn − C , so that with probability greater than 1 − e−ξ , Cn − C ≤ E [ Cn − C ] + 2M Moreover, by Jensen’s inequality E [ Cn − C ] ≤ E n 1 simple calculations leads to E n i=1 CXi − E [CX ] 2 4M n . This concludes the proof of lemma 1. 1 n 2 n i=1 CXi 1 = nE 2 ξ 2n . 1 2 − E [CX ] , and CX − E [CX ] 2 ≤ Proof of Theorem 3. The variation of this proof with respect to Theorem 5.2 in [7] is (a) to work directly in a (infinite-dimensional) Hilbert space, requiring extra caution for some details and (b) obtaining an improved bound by considering D-eigenspaces at once. The key property of Hilbert-Schmidt operators allowing to work directly in a infinite dimensional setting is that HS(H) is a both right and left ideal of Lc (H, H), the Banach space of all continuous linear operators of H endowed with the operator norm . op . Indeed, ∀ T ∈ HS(H), ∀S ∈ Lc (H, H), T S and ST belong to HS(H) with TS ≤ T S ST ≤ T and op S op . (4) The spectrum of an Hilbert-Schmidt operator T is denoted Λ(T ) and the sequence of eigenvalues in non-increasing order is denoted λ(T ) = (λ1 (T ) ≥ λ2 (T ) ≥ . . .) . In the following, P D (T ) denotes the orthogonal projector onto the D-eigenspace of T . The Hoffmann-Wielandt inequality in infinite dimensional setting[1] yields that: λ(A) − λ(A + B) 2 ≤ B ≤ δD . 2 (5) implying in particular that ∀i > 0, |λi (A) − λi (A + B)| ≤ δD . 2 (6) Results found in [5] p.39 yield the formula P D (A) − P D (A + B) = − 1 2iπ γ (RA (z) − RA+B (z))dz ∈ Lc (H, H) . (7) where RA (z) = (A − z Id)−1 is the resolvent of A, provided that γ is a simple closed curve in C enclosing exactly the first D eigenvalues of A and (A + B). Moreover, the same reference (p.60) states that for ξ in the complementary of Λ(A), RA (ξ) op = dist(ξ, Λ(A)) −1 . (8) The proof of the theorem now relies on the simple choice for the closed curve γ in (7), drawn in the picture below and consisting of three straight lines and a semi-circle of radius D L. For all L > δ2 , γ intersect neither the eigenspectrum of A (by equation (6)) nor the eigenspectrum of A + B. Moreover, the eigenvalues of A (resp. A + B) enclosed by γ are exactly λ1 (A), . . . , λD (A) (resp. λ1 (A + B), . . . , λD (A + B)). Moreover, for z ∈ γ, T (z) = RA (z) − RA+B (z) = −RA+B (z)BRA (z) belongs to HS(H) and depends continuously on z by (4). Consequently, P D (A) − P D (A + B) ≤ 1 2π b a (RA − RA+B )(γ(t)) |γ (t)|dt . N Let SN = n=0 (−1)n (RA (z)B)n RA (z). RA+B (z) = (Id + RA (z)B)−1 RA (z) and, for z ∈ γ and L > δD , RA (z)B op ≤ RA (z) op B ≤ δD 1 ≤ , 2 dist(z, Λ(A)) 2 γ L L δD λ 0 D+1 δD λ2 λD λ1 δD δD δD 2 2 2 L . op imply that SN −→ RA+B (z) (uniformly for z ∈ γ). Using property (4), since B ∈ . HS(H), SN BRA (z) −→ RA+B (z)BRA (z) = RA+B (z) − RA (z) . Finally, RA (z) − RA+B (z) = (−1)n (RA (z)B)n RA (z) n≥1 where the series converges in HS(H), uniformly in z ∈ γ. Using again property (4) and (8) implies B n (RA − RA+B )(γ(t)) ≤ RA (γ(t)) n+1 B n ≤ op distn+1 (γ(t), Λ(A)) n≥1 Finally, since for L > δD , B ≤ δD 2 n≥1 ≤ dist(γ(t),Λ(A)) , 2 b B 1 |γ (t)|dt . 2 (γ(t), Λ(A)) π a dist Splitting the last integral into four parts according to the definition of the contour γ, we obtain P D (A) − P D (A + B) ≤ 2arctan( δL ) 1 µ1 (A) − (µD (A) − δD ) π D |γ (t)|dt ≤ + +2 , dist2 (γ(t), Λ(A)) δD L L2 a and letting L goes to infinity leads to the result. b Proof of Theorem 4. Lemma 1 and Theorem 3 yield inequality (2). Together with as1 2 sumption n ≥ BD it implies PSD − PSD ≤ 2 . Let f ∈ SD : f = PSD (f ) + PSD (f ) . ⊥ b Lemma 5 below with F = SD and G = SD , and the fact that the operator norm is bounded by the Hilbert-Schmidt norm imply that 4 PSD (f ) 2 k ≤ PSD − PSD 2 PSD (f ) 2 k . ⊥ b H H 3 Gathering the different inequalities, Theorem 4 is proved. Lemma 5 Let F and G be two vector subspaces of H such that PF − PG the following bound holds: 4 PF − PG 2 PF (f ) 2 . ∀ f ∈ G , PF ⊥ (f ) 2 ≤ H op H 3 op 1 ≤ 2 . Then Proof of Lemma 5. = f − PF (f ) 2 = (PG − PF )(f ) 2 = PF − PF ⊥ (f ) 2 For f ∈ G, we have PG (f ) = f , hence PF (f ) 2 op PG 2 op ≤ P F − PG gathering the terms containing PF ⊥ (f ) 1/4 leads to the conclusion. 2 f 2 2 + PF ⊥ (f ) 2 on the left-hand side and using PF −PG 2 op ≤ References [1] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in infinite dimensions. Proc.Indian Acad.Sci(Math. Sci.) 104 (3), p. 483-494, 1994. [2] G. Blanchard, O. Bousquet, and L. Zwald. Statistical Properties of Kernel Principal Component Analysis. Proceedings of the 17th. Conference on Learning Theory (COLT 2004), p. 594–608. Springer, 2004. [3] G. Blanchard, P. Massart, R. Vert, and L. Zwald. Kernel projection machine: a new tool for pattern recognition. Proceedings of the 18th. Neural Information Processing System (NIPS 2004), p. 1649–1656. MIT Press, 2004. [4] J. Dauxois, A. Pousse, and Y. Romain. Asymptotic theory for the Principal Component Analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis 12, 136-154, 1982. [5] T. Kato. Perturbation Theory for Linear Operators. New-York: Springer-Verlag, 1966. [6] V. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191–227, 1998. [7] V. Koltchinskii and E. Gin´ . Random matrix approximation of spectra of integral e operators. Bernoulli, 6(1):113–167, 2000. [8] B. Sch¨ lkopf, A. J. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a o u kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [9] J. Shawe-Taylor and N. Cristianini. Estimating the moments of a random vector with applications. Proceedings of the GRETSI 2003 Conference, p. 47-52, 2003. [10] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix and the generalisation error of Kernel PCA. IEEE Transactions on Information Theory 51 (7), p. 2510-2522, 2005. [11] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Technical Report 134, Max Planck Institute for Biological Cybernetics, 2004. [12] U. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering on random samples: the normalized case. Proceedings of the 17th Annual Conference on Learning Theory (COLT 2004), p. 457–471. Springer, 2004. [13] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. Proceedings of the 17th International Conference on Machine Learning (ICML), p. 1159–1166. Morgan Kaufmann, 2000.

same-paper 3 0.77444726 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

Author: Edward Snelson, Zoubin Ghahramani

Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M , i.e. very sparse solutions, and it significantly outperforms other approaches in this regime. 1

4 0.60735857 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

Author: Manfred Opper

Abstract: The problem of computing a resample estimate for the reconstruction error in PCA is reformulated as an inference problem with the help of the replica method. Using the expectation consistent (EC) approximation, the intractable inference problem can be solved efficiently using only two variational parameters. A perturbative correction to the result is computed and an alternative simplified derivation is also presented. 1

5 0.60670811 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

Author: Sathiya Keerthi, Wei Chu

Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1

6 0.58575189 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

7 0.58031112 30 nips-2005-Assessing Approximations for Gaussian Process Classification

8 0.57797396 136 nips-2005-Noise and the two-thirds power Law

9 0.57681435 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

10 0.57451195 45 nips-2005-Conditional Visual Tracking in Kernel Space

11 0.57282043 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

12 0.57214636 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

13 0.57003611 74 nips-2005-Faster Rates in Regression via Active Learning

14 0.5691039 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

15 0.56815523 48 nips-2005-Context as Filtering

16 0.56741709 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

17 0.56724918 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

18 0.56674498 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

19 0.56600904 62 nips-2005-Efficient Estimation of OOMs

20 0.56278688 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models