jmlr jmlr2010 jmlr2010-104 knowledge-graph by maker-knowledge-mining

104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

Source: pdf

Author: Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Aníbal R. Figueiras-Vidal

Abstract: We present a new sparse Gaussian Process (GP) model for regression. The key novel idea is to sparsify the spectral representation of the GP. This leads to a simple, practical algorithm for regression tasks. We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class. Keywords: Gaussian process, probabilistic regression, sparse approximation, power spectrum, computational efﬁciency

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The key novel idea is to sparsify the spectral representation of the GP. [sent-12, score-0.15]

2 We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. [sent-14, score-0.114]

3 We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class. [sent-15, score-0.09]

4 In this paper we introduce a stationary trigonometric Bayesian model for regression that retains the computational efﬁciency of the aforementioned approaches, while improving performance. [sent-37, score-0.172]

5 The model consists of a linear combination of trigonometric functions where both weights and phases are integrated out. [sent-38, score-0.127]

6 This model is a stationary sparse GP that can approximate any desired stationary full GP. [sent-40, score-0.211]

7 FITC, SMGP, and the model introduced in this paper focus on predictive accuracy at low computational cost, rather than on faithfully converging towards the full GP as the number of basis functions grows. [sent-44, score-0.176]

8 Note that SMGP only extends FITC in the speciﬁc case of the anisotropic squared exponential covariance function, whereas FITC can be applied to any covariance function. [sent-58, score-0.214]

9 2 The properties of the GP prior over functions are governed by the covariance function Kff ij = k(xi , x j ) = E[ f (xi ) f (x j )] , (1) which determines how the similarity between a pair of function values varies as a function of the corresponding pair of inputs. [sent-65, score-0.105]

10 A covariance function is stationary if it only depends on the difference between its inputs k(xi , x j ) = k(xi − x j ) = k(τ). [sent-66, score-0.184]

11 The elegance of the GP framework is that the properties of the function are conveniently expressed directly in terms of the covariance function, rather than implicitly via basis functions. [sent-67, score-0.158]

12 ∗ n n (2) The covariance function is parameterized by hyperparameters. [sent-74, score-0.09]

13 Consider for example the stationary anisotropic squared exponential covariance function kARD (τ) = σ2 exp(− 1 τ⊤ Λ−1 τ), where Λ = diag([ℓ2 , ℓ2 , . [sent-75, score-0.19]

14 0 1 2 D 2 (3) The hyperparameters are the prior variance σ2 and the lengthscales {ℓd } that determine how rapidly 0 the covariance decays with the distance between inputs. [sent-79, score-0.177]

15 This covariance function is also known as the ARD (Automatic Relevance Determination) squared exponential, because it can effectively prune input dimensions by growing the corresponding lengthscales. [sent-80, score-0.139]

16 (4) Provided there exist analytic forms for the gradients of the covariance function with respect to the hyperparameters, the evidence can be maximized by using a gradient-based search. [sent-83, score-0.09]

17 Unfortunately, computing the evidence and the gradients requires the inversion of the covariance matrix Kff + σ2 In n at a cost of O (n3 ) operations, which is prohibitive for large data sets. [sent-84, score-0.09]

18 Trigonometric Bayesian Regression In this section we present a Bayesian linear regression model with trigonometric basis functions, and related it to a full GP in the next section. [sent-88, score-0.202]

19 Consider the model m f (x) = ∑ ar cos(2πs⊤ x) + br sin(2πs⊤ x) , r r (5) r=1 where each of the m pairs of basis functions is parametrized by a D-dimensional vector sr of spectral frequencies. [sent-89, score-0.269]

20 Note that each pair of basis functions share frequencies, but each have independent amplitude parameters, ar and br . [sent-90, score-0.104]

21 The priors are independent Gaussian ar ∼ N 0, σ2 0 , m br ∼ N 0, σ2 0 , m where the variances are scaled down linearly by the number of basis functions. [sent-92, score-0.099]

22 In contrast, the covariance function in Equation (6) is stationary, that is, the prior variance is independent of the input and equal to σ2 . [sent-98, score-0.105]

23 This is due to the particular nature of the trigonometric basis 0 functions and implies that the predictive variances cannot be “healed”, as proposed in Rasmussen and Quiñonero-Candela (2005) for the case of the Relevance Vector Machine. [sent-99, score-0.236]

24 f σ2 0 Similarly, for the log marginal likelihood mσ2 n 1 log p(y|θ) = − y⊤ y − y⊤ Φ⊤ A−1 Φf y /(2σ2 ) − log |A| + m log 2n − log 2πσ2 . [sent-105, score-0.113]

25 Both the predictive distribution and the marginal likelihood can be computed in O (nm2 ). [sent-107, score-0.096]

26 The storage costs are also reduced, since we no longer store the full covariance matrix (of size n × n), but only the design matrix (of size n × 2m). [sent-109, score-0.118]

27 Thus, the model based on trigonometric basis functions has practical use for modeling non-periodic functions. [sent-113, score-0.157]

28 2 Representation An alternative and equivalent representation of the model in Equation (5), which only uses half the number of trigonometric basis functions is possible, by writing the linear combination of a sine and a cosine as a cosine with an amplitude and a phase. [sent-117, score-0.255]

29 The Sparse Spectrum Gaussian Process In the previous section we presented an explicit basis function regression model, but we did not discuss how to select the frequencies deﬁning the basis functions. [sent-127, score-0.199]

30 In this section, we present a sparse GP approximation view of this model, which shows how it can be understood as a computationally efﬁcient approximation to any GP with stationary covariance function. [sent-128, score-0.207]

31 We will now take a generic GP with stationary covariance function and sparsify its power spectral density to obtain a sparse GP that approximates the full GP. [sent-130, score-0.408]

32 The power spectral density (or power spectrum) S(s) of a stationary random process expresses how the power is distributed over the frequency domain. [sent-131, score-0.297]

33 For a stationary GP, the power is equal to the prior variance k(x, x) = k(0) = σ2 . [sent-132, score-0.104]

34 In our case, given that f (·) is drawn from a stationary Gaussian process, the autocorrelation function is equal to the stationary covariance function, and we have: ⊤ ⊤ e−2πis τ k(τ)dτ. [sent-137, score-0.222]

35 24) states that any stationary covariance function k(τ) can be represented as the Fourier transform of a positive ﬁnite measure. [sent-140, score-0.156]

36 The proportionality constant can be directly obtained by evaluating the covariance function in (9) at τ = 0. [sent-142, score-0.09]

37 This last expression is an exact expansion of the covariance function as the expectation of a product of complex exponentials with respect to a particular distribution over their frequencies. [sent-146, score-0.09]

38 This integral can be approximated by simple Monte Carlo by taking an average of a few samples corresponding to a ﬁnite set of frequencies, which we call spectral points. [sent-147, score-0.128]

39 Since the power spectrum is symmetric around zero, a valid Monte Carlo procedure is to sample frequencies always as a pair {sr , −sr }. [sent-148, score-0.132]

40 Notice, that we have recovered exactly the expression for the covariance function induced by the trigonometric basis functions model, Equation (6). [sent-150, score-0.247]

41 Further, we have given an interpretation of the frequencies as spectral Monte Carlo samples, approximating any stationary covariance function. [sent-151, score-0.33]

42 0 This convergence result can also be stated as follows: A stationary GP can be seen as a neural network with inﬁnitely many hidden units and trigonometric activations if independent priors following Gaussian and pS (s) distributions are placed on the output and input weights, respectively. [sent-156, score-0.155]

43 This is analogous to the result of Williams (1997) for the non-stationary multilayer perceptron covariance function. [sent-157, score-0.09]

44 4 −8 SE covariance function 50 spectral points approx. [sent-167, score-0.249]

45 4 −8 8 −6 −4 (a) −2 0 τ 2 4 6 8 (b) Figure 1: Squared exponential covariance function and its approximation with (a) 10 and (b) 50 random spectral points respectively. [sent-174, score-0.249]

46 For illustration purposes, we compare the exact squared exponential covariance function with its spectral approximation in Figure 1, where the spectral points are sampled from Equation (12). [sent-177, score-0.411]

47 2 The SSGP Algorithm One of the main goals of sparse approximations is to reduce the computational burden while retaining as much predictive accuracy as possible. [sent-180, score-0.131]

48 Sampling from the spectral density constitutes a way of building a sparse approximation. [sent-181, score-0.179]

49 However, we may suspect that we can obtain much sparser models if the spectral frequencies are learned by optimizing the marginal likelihood, an idea which we pursue in the following. [sent-182, score-0.192]

50 The algorithm we propose uses conjugate gradients to optimize the marginal likelihood (8) with respect to the spectral points {sr } and the hyperparameters σ2 , σ2 , and {ℓ1 , ℓ2 , . [sent-183, score-0.216]

51 n 0 Optimizing with respect to the lengthscales in addition to the spectral points is effectively an overparametrization, but in our experience this redundancy proves helpful in avoiding undesired local minima. [sent-187, score-0.207]

52 At prediction time, the cost is O (m) for the predictive mean and O (m2 ) for the predictive variance per test point. [sent-201, score-0.142]

53 Learning the spectral frequencies by optimization departs from the original motivation of approximating a full GP. [sent-203, score-0.202]

54 However, the additional ﬂexibility can potentially improve performance since it allows learning a covariance function suitable to the problem at hand. [sent-205, score-0.09]

55 1 Comparing Predictive Distributions for SSGP and FITC Whereas SSGP relies on a sparse approximation to the spectrum, the FITC approximation is sparse in a spatial sense: A set of pseudo-inputs is used as an information bottleneck. [sent-230, score-0.102]

56 The only evaluations of the covariance function allowed are those involving a function value at a pseudo-input. [sent-231, score-0.09]

57 For a set of m pseudo-inputs the computational complexity of FITC is of the same order as that of SSGP with m spectral points. [sent-232, score-0.128]

58 The covariance function induced by FITC has a constant prior variance, but it is not stationary. [sent-233, score-0.105]

59 The original covariance of the full GP is only approximated faithfully in the vicinity of the pseudoinputs and the covariance between any two function values that are both far apart from any pseudo1872 S PARSE S PECTRUM G AUSSIAN P ROCESS R EGRESSION 0. [sent-234, score-0.244]

60 5 FITC SMGP SSGP fixed spectral points SSGP Full GP on 10000 data points 2 Mean Negative Log−Probability Normalized Mean Squared Error 0. [sent-242, score-0.247]

61 005 FITC SMGP SSGP fixed spectral points SSGP Full GP on 10000 data points 0. [sent-247, score-0.247]

62 5 1250 24 50 100 200 300 500 750 1250 Number of basis functions Number of basis functions (a) (b) Figure 3: Kin-40k data set. [sent-252, score-0.136]

63 2 FITC SMGP SSGP fixed spectral points SSGP Full GP on 7168 data points FITC SMGP SSGP fixed spectral points SSGP Full GP on 7168 data points 0. [sent-257, score-0.494]

64 2 100 10 Number of basis functions 24 50 74 100 Number of basis functions (a) (b) Figure 4: Pumadyn-32nm data set. [sent-267, score-0.136]

65 SSGP is given 20 ﬁxed n spectral points sampled from the spectrum of a squared exponential covariance function, and FITC is given 40 ﬁxed pseudo-inputs sampled uniformly from the range of the training inputs. [sent-272, score-0.346]

66 Despite the different nature of the approximations, the ﬁgure shows that for an equal number of basis functions both predictive distributions are qualitatively very similar: the uncertainty grows away from the training data. [sent-275, score-0.131]

67 For FITC this is equivalent to the number of pseudoinputs, whereas for SSGP a spectral point corresponds to two basis functions. [sent-286, score-0.196]

68 The SSGP with ﬁxed spectral points is inferior, proving that a greater sparsity vs. [sent-295, score-0.159]

69 accuracy tradeoff can be achieved by optimizing the spectral points. [sent-296, score-0.128]

70 SSGP matches the full GP for a surprisingly small number of basis functions. [sent-301, score-0.096]

71 SSGP is superior in terms of NMSE, getting very close to the full GP for more than 200 basis functions. [sent-315, score-0.096]

72 We also see that SSGP with ﬁxed spectral points is uniformly worse. [sent-318, score-0.159]

73 In practice the spectral points are sampled from the spectral density of a squared exponential covariance function, and scaled as the lengthscales adapt. [sent-320, score-0.459]

74 5 FITC SMGP SSGP fixed spectral points SSGP Full GP on 10000 data points 0. [sent-322, score-0.247]

75 01 10 24 50 100 250 500 1000 FITC SMGP SSGP fixed spectral points SSGP Full GP on 10000 data points 5 4. [sent-330, score-0.247]

76 5 10 24 50 100 250 500 1000 Number of basis functions Number of basis functions (a) (b) Figure 5: Pole Telecomm data set. [sent-333, score-0.136]

77 SSGP consistently outperforms FITC and SMGP and gets very close to the full GP using a very low number of basis functions. [sent-339, score-0.096]

78 The large NMSE average errors incurred by SSGP with ﬁxed spectral points for small numbers of basis functions are due to outliers that are present in a small number (about 10 out of 7847) of the test inputs, in some of the 10 repeated runs. [sent-340, score-0.227]

79 The predictive variances for these few points are also big, so their impact on the MNLP score is small. [sent-341, score-0.11]

80 Whereas the performance for the two variants are comparable for small numbers of basis functions, the cosine only representation becomes worse when the number of basis functions gets larger, conﬁrming our suspicion that optimization of the phases increases the risk of overﬁtting. [sent-348, score-0.22]

81 8 FITC SMGP SSGP fixed spectral points SSGP Full GP on 8752 data points 0. [sent-351, score-0.247]

82 15 FITC SMGP SSGP fixed spectral points SSGP Full GP on 8752 data points −4 −4. [sent-353, score-0.247]

83 1 10 24 50 100 250 500 750 1000 Number of basis functions 10 24 50 100 250 500 750 1000 Number of basis functions (a) (b) Figure 6: Elevators data set. [sent-358, score-0.136]

84 4 The Pendulum Data Set So far we have seen data sets where SSGP consistently outperforms FITC and SMGP, and often approaches the performance of a full GP for quite small numbers of basis functions. [sent-361, score-0.096]

85 Note that we use up to 800 basis functions for investigation, although for computational reasons it would make sense to use the full GP rather than an approximation with more than 315 basis functions. [sent-366, score-0.164]

86 A closer inspection shows that the mean predictions are quite accurate, the predictive variances are excessively small. [sent-368, score-0.095]

87 Note, that the SSGP with ﬁxed spectral points seems to suffer much less from this effect, as would be expected. [sent-370, score-0.159]

88 5 10 24 50 100 250 500 1000 Number of basis functions Number of basis functions (a) (b) Figure 7: Pole Telecomm data set. [sent-384, score-0.136]

89 (a) NMSE and (b) MNLP as a function of the number of basis functions, comparing SSGP with the version with cosines only and explicit phases. [sent-385, score-0.093]

90 However, we found a small data set where SSGP badly fails, with good predictive means but with overconﬁdent predictive variances. [sent-393, score-0.126]

91 Other algorithms, such as the variational approach of Titsias (2009) which focus on approaching the full GP in the limit of large numbers of basis functions are to a large degree safeguarded from overﬁtting. [sent-395, score-0.096]

92 4 FITC SMGP SSGP fixed spectral points SSGP Full GP on 315 data points 16 Mean Negative Log−Probability FITC SMGP SSGP fixed spectral points SSGP Full GP on 315 data points 14 12 10 8 6 4 0. [sent-400, score-0.494]

93 25 10 24 50 100 200 400 10 800 24 50 100 200 400 800 Number of basis functions Number of basis functions (a) (b) Figure 8: Pendulum data set. [sent-402, score-0.136]

94 An equivalent view of SSGP is as a sparse Bayesian linear combination of pairs of trigonometric basis functions, a sine and a cosine for each spectral point. [sent-406, score-0.383]

95 The weights are integrated out, and at the price of having two basis functions per frequency, the phases are effectively integrated out as well. [sent-407, score-0.106]

96 We have shown that although a representation in terms of a single basis function per frequency and an explicit phase is possible, learning the phases poses an increased risk of overﬁtting. [sent-408, score-0.156]

97 If the spectral points are sampled from the power spectrum of a stationary GP, then SSGP approximates its covariance function. [sent-409, score-0.401]

98 However, much sparser solutions can be achieved by learning the spectral points, which effectively implies learning the covariance function. [sent-410, score-0.218]

99 The SSGP model is to the best of our knowledge the only sparse GP approximation that induces a stationary covariance function. [sent-411, score-0.207]

100 The main differences between SSGP and most previous approaches to sparse GP regression is the stationarity of the prior and the non-local nature of the basis functions. [sent-416, score-0.151]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ssgp', 0.725), ('fitc', 0.347), ('gp', 0.286), ('smgp', 0.213), ('spectral', 0.128), ('mnlp', 0.116), ('nmse', 0.107), ('qui', 0.091), ('covariance', 0.09), ('trigonometric', 0.089), ('andela', 0.077), ('idal', 0.077), ('igueiras', 0.077), ('kff', 0.077), ('pectrum', 0.077), ('redilla', 0.077), ('zaro', 0.077), ('basis', 0.068), ('stationary', 0.066), ('rocess', 0.066), ('asmussen', 0.066), ('onero', 0.066), ('predictive', 0.063), ('spectrum', 0.063), ('egression', 0.059), ('sr', 0.058), ('snelson', 0.058), ('fixed', 0.057), ('rasmussen', 0.055), ('ps', 0.052), ('aussian', 0.051), ('sparse', 0.051), ('lengthscales', 0.048), ('frequencies', 0.046), ('madrid', 0.041), ('elevators', 0.039), ('kf', 0.039), ('pendulum', 0.039), ('telecomm', 0.039), ('phases', 0.038), ('gaussian', 0.038), ('gps', 0.037), ('parse', 0.036), ('squared', 0.034), ('frequency', 0.034), ('amplitudes', 0.033), ('points', 0.031), ('cosine', 0.03), ('isr', 0.029), ('kard', 0.029), ('walder', 0.029), ('full', 0.028), ('inputs', 0.028), ('pole', 0.027), ('joaquin', 0.025), ('miguel', 0.025), ('cosines', 0.025), ('universidad', 0.025), ('ui', 0.025), ('cos', 0.024), ('hyperparameters', 0.024), ('power', 0.023), ('rahimi', 0.022), ('sparsify', 0.022), ('fourier', 0.022), ('se', 0.022), ('williams', 0.021), ('seeger', 0.021), ('amplitude', 0.021), ('carlos', 0.021), ('bal', 0.019), ('comunicaciones', 0.019), ('legan', 0.019), ('overcon', 0.019), ('pseudoinputs', 0.019), ('teor', 0.019), ('tsc', 0.019), ('urtasun', 0.019), ('marginal', 0.018), ('targets', 0.017), ('regression', 0.017), ('approximations', 0.017), ('titsias', 0.017), ('csat', 0.017), ('sine', 0.017), ('sinc', 0.017), ('faithfully', 0.017), ('departamento', 0.017), ('variances', 0.016), ('risk', 0.016), ('log', 0.016), ('mean', 0.016), ('dimensions', 0.015), ('bayesian', 0.015), ('likelihood', 0.015), ('br', 0.015), ('sin', 0.015), ('ds', 0.015), ('prior', 0.015), ('asterisk', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

Author: Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Aníbal R. Figueiras-Vidal

2 0.24435462 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

Author: Carl Edward Rasmussen, Hannes Nickisch

Abstract: The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are speciﬁed by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classiﬁcation. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks. Keywords: Gaussian processes, nonparametric Bayes, probabilistic regression and classiﬁcation Gaussian processes (GPs) (Rasmussen and Williams, 2006) have convenient properties for many modelling tasks in machine learning and statistics. They can be used to specify distributions over functions without having to commit to a speciﬁc functional form. Applications range from regression over classiﬁcation to reinforcement learning, spatial models, survival and other time series1 models. Predictions of GP models come with a natural conﬁdence measure: predictive error-bars. Although the implementation of the basic principles in the simplest case is straight forward, various complicating features are often desired in practice. For example, a GP is determined by a mean function and a covariance function, but these functions are mostly difﬁcult to specify fully a priori, and typically they are given in terms of hyperparameters, that is, parameters which have to be inferred. Another source of difﬁculty is the likelihood function. For Gaussian likelihoods, inference is analytically tractable; however, in many tasks, Gaussian likelihoods are not appropriate, and approximate inference methods such as Expectation Propagation (EP) (Minka, 2001), Laplace’s approximation (LA) (Williams and Barber, 1998) and variational bounds (VB) (Gibbs and MacKay, 2000) become necessary (Nickisch and Rasmussen, 2008). In case of large training data, approximations (Candela and Rasmussen, 2005) like FITC (Snelson and Ghahramani, 2006) are needed. The GPML toolbox is designed to overcome these hurdles with its variety of mean, covariance and likelihood functions as well as inference methods, while being simple to use and easy to extend. ∗. Also at Max Planck Institute for Biological Cybernetics, Spemannstraße 38, 72076 T¨ bingen, Germany. u 1. Note, that here we typically think of GPs with a more general index set than time. ©2010 Carl Edward Rasmussen and Hannes Nickisch. R ASMUSSEN AND N ICKISCH 1. Implementation The GPML toolbox can be obtained from http://gaussianprocess.org/gpml/code/matlab/ and also http://mloss.org/software/view/263/ under the FreeBSD license. Based on simple interfaces for covariance, mean, likelihood functions as well as inference methods, we offer full compatibility to both Matlab 7.x2 and GNU Octave 3.2.x.3 Special attention has been given to properly disentangle covariance, likelihood and mean hyperparameters. Also, care has been taken to avoid numerical inaccuracies, for example, safe likelihood evaluations for extreme inputs and stable matrix operations. For example, the covariance matrix K can become numerically close to singular making its naive inversion numerically unsafe. We handle these situations in a principled way4 such that Cholesky decompositions are computed of well-conditioned matrices only. As a result, our code shows a high level of robustness along the full spectrum of possible hyperparameters. The focus of the toolbox is on approximate inference using dense matrix algebra. We currently do not support covariance matrix approximation techniques to deal with large numbers of training examples n. Looking at the (growing) body of literature on sparse approximations, this knowledge is still somewhat in ﬂux, and consensus on the best approaches has not yet been reached. We provide stable and modular code checked by an exhaustive suite of test cases. A single function gp.m serves as main interface to the user—it can make inference and predictions and allows the mean, covariance and likelihood function as well as the inference methods to be speciﬁed freely. Furthermore, gp.m enables convenient learning of the hyperparameters by maximising the log marginal likelihood ln Z. One of the particularly appealing properties of GP models is that principled and practical approaches exist for learning the parameters of mean, covariance and likelihood functions. Good adaptation of such parameters can be essential to obtain both high quality predictions and insights into the properties of the data. The GPML toolbox is particularly ﬂexible, including a large library of different covariance and mean functions, and ﬂexible ways to combine these into more expressive, specialised functions. The user can choose between two gradient-based optimisers: one uses conjugate gradients (CG)5 and the other one relies on a quasi-Newton scheme.6 ∂ Computing the derivatives w.r.t. hyperparameters ∂θi ln Z with gp.m does not need any extra programming effort; every inference method automatically collects the respective derivatives from the mean, covariance and likelihood functions and passes them to gp.m. Our documentation comes in two pieces: a hypertext user documentation7 doc/index.html with examples and code browsing and a technical documentation8 doc/manual.pdf focusing on the interfaces and more technical issues. A casual user will use the hypertext document to quickly get his data analysed, however a power user will consult the pdf document once he wants to include his own mean, covariance, likelihood and inference routines or learn about implementation details. 2. 3. 4. 5. 6. 7. 8. Matlab is available from MathWorks, http://www.mathworks.com/. Octave is available from the Free Software Foundation, http://www.gnu.org/software/octave/. We do not consider the “blind” addition of a “small ridge” to K a principled way. Carl Rasmussen’s code is available at http://www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/. Peter Carbonetto’s wrapper can be found at http://www.cs.ubc.ca/˜pcarbo/lbfgsb-for-matlab.html. Documentation can be found at http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html. Technical docs are available at http://www.gaussianprocess.org/gpml/code/matlab/doc/manual.pdf. 3012 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX 2. The GPML Toolbox We illustrate the modular structure of the GPML toolbox by means of a simple code example. GPs are used to formalise and update knowledge about distributions over functions. A GP prior distribution on an unknown latent function f ∼ GP (mφ (x), kψ (x, x′ )), consists of a mean function m(x) = E[ f (x)], and a covariance function k(x, x) = E[( f (x) − m(x))( f (x′ ) − m(x′ ))], both of which typically contain hyperparameters φ and ψ, which we want to ﬁt in the light of data. We generally assume independent observations, that is, input/output pairs (xi , yi ) of f with joint likelihood Pρ (y|f) = ∏n Pρ (yi | f (xi )) factorising over cases. Finally, after speciﬁcation of the prior and i=1 ﬁtting of the hyperparameters θ = {φ, ψ, ρ}, we wish to compute predictive distributions for test cases. 1 2 3 4 5 6 7 % 1) SET UP THE GP : COVARIANCE ; MEAN , LIKELIHOOD , INFERENCE METHOD mf = { ’ meanSum ’ ,{ ’ meanLinear ’, @meanConst }}; a = 2; b = 1; % m(x) = a*x+b cf = { ’ covSEiso ’}; sf = 1; ell = 0.7; % squared exponential covariance funct lf = ’ likLaplace ’; sn = 0.2; % assume Laplace noise with variance sn ˆ2 hyp0 . mean = [a;b ]; hyp0 . cov = log([ ell ; sf ]); hyp0 . lik = log( sn ); % hypers inf = ’ infEP ’; % specify expectation propagation as inference method % 2) MINIMISE NEGATIVE LOG MARGINAL LIKELIHOOD nlZ wrt . hyp ; do 50 CG steps Ncg = 50; [hyp , nlZ ] = minimize ( hyp0 , ’gp ’, -Ncg , inf , mf , cf , lf , X , y ); % 3) PREDICT AT UNKNOWN TEST INPUTS [ymu , ys2 ] = gp (hyp , inf , mf , cf , lf , X , y , Xs ); % test input Xs In line 1, we specify the mean mφ (x) = a⊤ x + b of the GP with hyperparameters φ = {a, b}. First, the functional form of the mean function is given and its parameters are initialised. The desired mean function, happens not to exist in the library of mean functions; instead we have to make a composite mean function from simple constituents. This is done using a nested cell array containing the algebraic expression for m(x): As the sum of a linear (mean/meanLinear.m) and a constant mean function (mean/meanConst.m) it is an afﬁne function. In addition to linear and constant mean functions, the toolbox offers m(x) = 0 and m(x) = 1. These simple mean functions can be combined by composite mean functions to obtain sums (mean/meanSum.m) m(x) = ∑ j m j (x), products m(x) = ∏ j m j (x), scaled versions m(x) = αm0 (x) and powers m(x) = m0 (x)d . This ﬂexible mechanism is used for convenient speciﬁcation of an extensible algebra of mean functions. Note that functions are referred to either as name strings ’meanConst’ or alternatively function handles @meanConst. The order of components of the hyperparameters φ is the same as in the speciﬁcation of the cell array. Every mean function implements its evaluation m = mφ (X) and ﬁrst derivative ∂ computation mi = ∂φi mφ (X) on a data set X. In the same spirit, the squared exponential covariance kψ (x, x′ ) = σ f ² exp(− x − x′ 2 /2ℓ2 ) (cov/covSEiso.m) with hyperparameters ψ = {ln ℓ, ln σ f } is set up in line 2. Note, that the hyperparameters are represented by the logarithms, as these parameters are naturally positive. Many other simple covariance functions are contained in the toolbox. Among others, we offer linear, constant, Mat´ rn, rational quadratic, polynomial, periodic, neural network and ﬁnite support coe variance functions. Composite covariance functions allow for sums k(x, x′ ) = ∑ j k j (x, x′ ), products k(x, x′ ) = ∏ j k j (x, x′ ), positive scaling k(x, x′ ) = σ2 k0 (x, x′ ) and masking of components f k(x, x′ ) = k0 (xI , x′ ) with I ⊆ [1, 2, .., D], x ∈ RD . Again, the interface is simple since only the I ∂ evaluation of the covariance matrix K = kψ (X) and its derivatives ∂i K = ∂ψi kψ (X) on a data set X are required. Furthermore, we need cross terms k∗ = kψ (X, x∗ ) and k∗∗ = kψ (x∗ , x∗ ) for prediction. There are no restrictions on the composition of both mean and covariance functions—any combination is allowed including nested composition. 3013 R ASMUSSEN AND N ICKISCH √ √ The Laplace (lik/likLaplace.m) likelihood Pρ (y| f ) = exp(− 2/σn |y − f |)/ 2σn with hyperparameters ρ = {ln σn } is speciﬁed in line 3. There are only simple likelihood functions: Gaussian, Sech-squared, Laplacian and Student’s t for ordinary and sparse regression as well as the error and the logistic function for classiﬁcation. Again, the same inference code is used for any likelihood function. Although the speciﬁcation of likelihood functions is simple for the user, writing new likelihood functions is slightly more involved as different inference methods require access to different properties; for example, LA requires second derivatives and EP requires derivatives of moments. All hyperparameters θ = {φ, ψ, ρ} are stored in a struct hyp.{mean,cov,lik}, which is initialised in line 4; we select the approximate inference algorithm EP (inf/infEP.m) in line 5. We optimise the hyperparameters θ ≡ hyp by calling the CG optimiser (util/minimize.m) with initial value θ0 ≡ hyp0 in line 6 allowing at most N = 50 evaluations of the EP approximation to the marginal likelihood ZEP (θ) as done by gp.m. Here, D = (X, y) ≡ (X,y) is the training data where X = {x1 , .., xn } and y ∈ Rn . Under the hood, gp.m computes in every step a Gaussian ∂ posterior approximation and the derivatives ∂θ ln ZEP (θ) of the marginal likelihood by calling EP. Predictions with optimised hyperparameters are done in line 7, where we call gp.m with the unseen test inputs X∗ ≡ Xs as additional argument. As a result, we obtain the approximate marginal predictive mean E[P(y∗ |D , X∗ )] ≡ ymu and the predictive variance V[P(y∗ |D , X∗ )] ≡ ys2. Likelihood \ Inference Gaussian Sech-squared Laplacian Student’s t Error function Logistic function Exact FITC EP Laplace VB Type, Output Domain regression, R regression, R regression, R regression, R classiﬁcation, {±1} classiﬁcation, {±1} Alternate Name logistic distribution double exponential probit regression logit regression Table 1: Likelihood ↔ inference compatibility in the GPML toolbox Table 1 gives the legal likelihood/inference combinations. Exact inference and the FITC approximation support the Gaussian likelihood only. Variational Bayesian (VB) inference is applicable to all likelihoods. Expectation propagation (EP) for the Student’s t likelihood is inherently unstable due to its non-log-concavity. The Laplace approximation (LA) for Laplace likelihoods is not sensible due to the non-differentiable peak of the Laplace likelihood. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and LA. If the number of training examples is larger than a few thousand, dense matrix computations become too slow. We provide the FITC approximation for regression with Gaussian likelihood where ˜ instead of the exact covariance matrix K, a low-rank plus diagonal matrix K = Q + diag(K − Q) ⊤ K−1 K is used. The matrices K and K contain covariances and cross-covariances where Q = Ku uu u uu u of and between inducing inputs ui and data points x j . Using inf/infFITC.m together with any covariance function wrapped into cov/covFITC.m makes the computations feasible for large n. Acknowledgments Thanks to Ed Snelson for assisting with the FITC approximation. 3014 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX References Joaquin Qui˜ onero Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gausn sian process regression. Journal of Machine Learning Research, 6(6):1935–1959, 2005. Mark N. Gibbs and David J. C. MacKay. Variational Gaussian process classiﬁers. IEEE Transactions on Neural Networks, 11(6):1458–1464, 2000. Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann, 2001. Hannes Nickisch and Carl E. Rasmussen. Approximations for binary Gaussian process classiﬁcation. Journal of Machine Learning Research, 9:2035–2078, 10 2008. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, 2006. Christopher K. I. Williams and D. Barber. Bayesian classiﬁcation with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(20):1342–1351, 1998. 3015

3 0.057825875 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

Author: Jitkomut Songsiri, Lieven Vandenberghe

Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization

4 0.047689829 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

5 0.035562623 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

Abstract: Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisﬁes the restricted nullspace property, and (2) the squared ℓ2 -error of a Lasso estimate decays at the minimax optimal rate k log p , where k is the sparsity of the p-dimensional regression problem with additive n Gaussian noise, whenever the design satisﬁes a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisﬁes these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisﬁed when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on ℓ1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure

6 0.028836243 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

7 0.028390698 82 jmlr-2010-On Learning with Integral Operators

8 0.026956528 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

9 0.025821202 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

10 0.024475738 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

11 0.02361347 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

12 0.023059426 48 jmlr-2010-How to Explain Individual Classification Decisions

13 0.022820262 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

14 0.022707127 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

15 0.022541037 84 jmlr-2010-On Spectral Learning

16 0.021982932 15 jmlr-2010-Approximate Tree Kernels

17 0.021697551 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

18 0.021355346 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting

19 0.020907102 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

20 0.020855097 44 jmlr-2010-Graph Kernels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.114), (1, -0.023), (2, -0.008), (3, -0.035), (4, -0.124), (5, -0.0), (6, 0.108), (7, -0.127), (8, -0.139), (9, -0.003), (10, -0.041), (11, -0.082), (12, -0.233), (13, -0.091), (14, 0.134), (15, 0.245), (16, -0.023), (17, 0.106), (18, 0.34), (19, -0.239), (20, -0.035), (21, 0.088), (22, 0.119), (23, -0.075), (24, 0.209), (25, -0.046), (26, 0.178), (27, -0.193), (28, 0.017), (29, 0.085), (30, -0.035), (31, -0.06), (32, -0.089), (33, -0.007), (34, -0.011), (35, 0.021), (36, -0.011), (37, -0.13), (38, 0.139), (39, -0.025), (40, -0.003), (41, 0.091), (42, -0.031), (43, 0.015), (44, -0.118), (45, -0.011), (46, -0.034), (47, -0.058), (48, -0.054), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9454934 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

Author: Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Aníbal R. Figueiras-Vidal

2 0.83178258 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

Author: Carl Edward Rasmussen, Hannes Nickisch

3 0.23291953 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

4 0.16889444 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

Author: Jitkomut Songsiri, Lieven Vandenberghe

5 0.16321591 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra

Abstract: Traditional analysis methods for single-trial classiﬁcation of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classiﬁcation, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an uniﬁed framework for the analysis of EEG, combining ﬁrst and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classiﬁcation for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically signiﬁcant gains in classiﬁcation accuracy, with a reduction in overall classiﬁcation error from 26%-28% to 19%. Keywords: regularization, classiﬁcation, bilinear decomposition, neural signals, brain computer interface

6 0.15982413 109 jmlr-2010-Stochastic Composite Likelihood

7 0.14719528 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

8 0.14157903 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

9 0.12094102 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

10 0.11538558 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

11 0.11528078 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting

12 0.11040129 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

13 0.10979784 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

14 0.10942154 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

15 0.10496262 63 jmlr-2010-Learning Instance-Specific Predictive Models

16 0.10273582 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

17 0.098969959 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

18 0.096330598 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

19 0.093004957 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

20 0.092041686 84 jmlr-2010-On Spectral Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.02), (8, 0.012), (21, 0.01), (29, 0.381), (32, 0.041), (36, 0.042), (37, 0.165), (75, 0.112), (81, 0.015), (85, 0.046), (96, 0.013), (97, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.66791028 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

Author: Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Aníbal R. Figueiras-Vidal

2 0.47954839 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

3 0.4784897 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE ﬁts model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random ﬁelds. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random ﬁelds

4 0.47185645 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

Author: Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin

Abstract: Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difﬁcult to understand them and see the differences. In this paper, we create a general and uniﬁed framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods. Keywords: maximum entropy, iterative scaling, coordinate descent, natural language processing, optimization

5 0.4711144 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc

Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for uniﬁed large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the speciﬁc algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R

6 0.4641827 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

7 0.40479892 118 jmlr-2010-libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models

8 0.40332711 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

9 0.39999062 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

10 0.38659963 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project

11 0.37912601 109 jmlr-2010-Stochastic Composite Likelihood

12 0.37769306 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

13 0.37450483 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

14 0.37417492 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

15 0.3738358 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

16 0.37328875 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

17 0.37053248 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

18 0.36760724 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

19 0.36742154 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

20 0.36700979 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels