jmlr jmlr2010 jmlr2010-41 jmlr2010-41-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Carl Edward Rasmussen, Hannes Nickisch
Abstract: The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks. Keywords: Gaussian processes, nonparametric Bayes, probabilistic regression and classification Gaussian processes (GPs) (Rasmussen and Williams, 2006) have convenient properties for many modelling tasks in machine learning and statistics. They can be used to specify distributions over functions without having to commit to a specific functional form. Applications range from regression over classification to reinforcement learning, spatial models, survival and other time series1 models. Predictions of GP models come with a natural confidence measure: predictive error-bars. Although the implementation of the basic principles in the simplest case is straight forward, various complicating features are often desired in practice. For example, a GP is determined by a mean function and a covariance function, but these functions are mostly difficult to specify fully a priori, and typically they are given in terms of hyperparameters, that is, parameters which have to be inferred. Another source of difficulty is the likelihood function. For Gaussian likelihoods, inference is analytically tractable; however, in many tasks, Gaussian likelihoods are not appropriate, and approximate inference methods such as Expectation Propagation (EP) (Minka, 2001), Laplace’s approximation (LA) (Williams and Barber, 1998) and variational bounds (VB) (Gibbs and MacKay, 2000) become necessary (Nickisch and Rasmussen, 2008). In case of large training data, approximations (Candela and Rasmussen, 2005) like FITC (Snelson and Ghahramani, 2006) are needed. The GPML toolbox is designed to overcome these hurdles with its variety of mean, covariance and likelihood functions as well as inference methods, while being simple to use and easy to extend. ∗. Also at Max Planck Institute for Biological Cybernetics, Spemannstraße 38, 72076 T¨ bingen, Germany. u 1. Note, that here we typically think of GPs with a more general index set than time. ©2010 Carl Edward Rasmussen and Hannes Nickisch. R ASMUSSEN AND N ICKISCH 1. Implementation The GPML toolbox can be obtained from http://gaussianprocess.org/gpml/code/matlab/ and also http://mloss.org/software/view/263/ under the FreeBSD license. Based on simple interfaces for covariance, mean, likelihood functions as well as inference methods, we offer full compatibility to both Matlab 7.x2 and GNU Octave 3.2.x.3 Special attention has been given to properly disentangle covariance, likelihood and mean hyperparameters. Also, care has been taken to avoid numerical inaccuracies, for example, safe likelihood evaluations for extreme inputs and stable matrix operations. For example, the covariance matrix K can become numerically close to singular making its naive inversion numerically unsafe. We handle these situations in a principled way4 such that Cholesky decompositions are computed of well-conditioned matrices only. As a result, our code shows a high level of robustness along the full spectrum of possible hyperparameters. The focus of the toolbox is on approximate inference using dense matrix algebra. We currently do not support covariance matrix approximation techniques to deal with large numbers of training examples n. Looking at the (growing) body of literature on sparse approximations, this knowledge is still somewhat in flux, and consensus on the best approaches has not yet been reached. We provide stable and modular code checked by an exhaustive suite of test cases. A single function gp.m serves as main interface to the user—it can make inference and predictions and allows the mean, covariance and likelihood function as well as the inference methods to be specified freely. Furthermore, gp.m enables convenient learning of the hyperparameters by maximising the log marginal likelihood ln Z. One of the particularly appealing properties of GP models is that principled and practical approaches exist for learning the parameters of mean, covariance and likelihood functions. Good adaptation of such parameters can be essential to obtain both high quality predictions and insights into the properties of the data. The GPML toolbox is particularly flexible, including a large library of different covariance and mean functions, and flexible ways to combine these into more expressive, specialised functions. The user can choose between two gradient-based optimisers: one uses conjugate gradients (CG)5 and the other one relies on a quasi-Newton scheme.6 ∂ Computing the derivatives w.r.t. hyperparameters ∂θi ln Z with gp.m does not need any extra programming effort; every inference method automatically collects the respective derivatives from the mean, covariance and likelihood functions and passes them to gp.m. Our documentation comes in two pieces: a hypertext user documentation7 doc/index.html with examples and code browsing and a technical documentation8 doc/manual.pdf focusing on the interfaces and more technical issues. A casual user will use the hypertext document to quickly get his data analysed, however a power user will consult the pdf document once he wants to include his own mean, covariance, likelihood and inference routines or learn about implementation details. 2. 3. 4. 5. 6. 7. 8. Matlab is available from MathWorks, http://www.mathworks.com/. Octave is available from the Free Software Foundation, http://www.gnu.org/software/octave/. We do not consider the “blind” addition of a “small ridge” to K a principled way. Carl Rasmussen’s code is available at http://www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/. Peter Carbonetto’s wrapper can be found at http://www.cs.ubc.ca/˜pcarbo/lbfgsb-for-matlab.html. Documentation can be found at http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html. Technical docs are available at http://www.gaussianprocess.org/gpml/code/matlab/doc/manual.pdf. 3012 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX 2. The GPML Toolbox We illustrate the modular structure of the GPML toolbox by means of a simple code example. GPs are used to formalise and update knowledge about distributions over functions. A GP prior distribution on an unknown latent function f ∼ GP (mφ (x), kψ (x, x′ )), consists of a mean function m(x) = E[ f (x)], and a covariance function k(x, x) = E[( f (x) − m(x))( f (x′ ) − m(x′ ))], both of which typically contain hyperparameters φ and ψ, which we want to fit in the light of data. We generally assume independent observations, that is, input/output pairs (xi , yi ) of f with joint likelihood Pρ (y|f) = ∏n Pρ (yi | f (xi )) factorising over cases. Finally, after specification of the prior and i=1 fitting of the hyperparameters θ = {φ, ψ, ρ}, we wish to compute predictive distributions for test cases. 1 2 3 4 5 6 7 % 1) SET UP THE GP : COVARIANCE ; MEAN , LIKELIHOOD , INFERENCE METHOD mf = { ’ meanSum ’ ,{ ’ meanLinear ’, @meanConst }}; a = 2; b = 1; % m(x) = a*x+b cf = { ’ covSEiso ’}; sf = 1; ell = 0.7; % squared exponential covariance funct lf = ’ likLaplace ’; sn = 0.2; % assume Laplace noise with variance sn ˆ2 hyp0 . mean = [a;b ]; hyp0 . cov = log([ ell ; sf ]); hyp0 . lik = log( sn ); % hypers inf = ’ infEP ’; % specify expectation propagation as inference method % 2) MINIMISE NEGATIVE LOG MARGINAL LIKELIHOOD nlZ wrt . hyp ; do 50 CG steps Ncg = 50; [hyp , nlZ ] = minimize ( hyp0 , ’gp ’, -Ncg , inf , mf , cf , lf , X , y ); % 3) PREDICT AT UNKNOWN TEST INPUTS [ymu , ys2 ] = gp (hyp , inf , mf , cf , lf , X , y , Xs ); % test input Xs In line 1, we specify the mean mφ (x) = a⊤ x + b of the GP with hyperparameters φ = {a, b}. First, the functional form of the mean function is given and its parameters are initialised. The desired mean function, happens not to exist in the library of mean functions; instead we have to make a composite mean function from simple constituents. This is done using a nested cell array containing the algebraic expression for m(x): As the sum of a linear (mean/meanLinear.m) and a constant mean function (mean/meanConst.m) it is an affine function. In addition to linear and constant mean functions, the toolbox offers m(x) = 0 and m(x) = 1. These simple mean functions can be combined by composite mean functions to obtain sums (mean/meanSum.m) m(x) = ∑ j m j (x), products m(x) = ∏ j m j (x), scaled versions m(x) = αm0 (x) and powers m(x) = m0 (x)d . This flexible mechanism is used for convenient specification of an extensible algebra of mean functions. Note that functions are referred to either as name strings ’meanConst’ or alternatively function handles @meanConst. The order of components of the hyperparameters φ is the same as in the specification of the cell array. Every mean function implements its evaluation m = mφ (X) and first derivative ∂ computation mi = ∂φi mφ (X) on a data set X. In the same spirit, the squared exponential covariance kψ (x, x′ ) = σ f ² exp(− x − x′ 2 /2ℓ2 ) (cov/covSEiso.m) with hyperparameters ψ = {ln ℓ, ln σ f } is set up in line 2. Note, that the hyperparameters are represented by the logarithms, as these parameters are naturally positive. Many other simple covariance functions are contained in the toolbox. Among others, we offer linear, constant, Mat´ rn, rational quadratic, polynomial, periodic, neural network and finite support coe variance functions. Composite covariance functions allow for sums k(x, x′ ) = ∑ j k j (x, x′ ), products k(x, x′ ) = ∏ j k j (x, x′ ), positive scaling k(x, x′ ) = σ2 k0 (x, x′ ) and masking of components f k(x, x′ ) = k0 (xI , x′ ) with I ⊆ [1, 2, .., D], x ∈ RD . Again, the interface is simple since only the I ∂ evaluation of the covariance matrix K = kψ (X) and its derivatives ∂i K = ∂ψi kψ (X) on a data set X are required. Furthermore, we need cross terms k∗ = kψ (X, x∗ ) and k∗∗ = kψ (x∗ , x∗ ) for prediction. There are no restrictions on the composition of both mean and covariance functions—any combination is allowed including nested composition. 3013 R ASMUSSEN AND N ICKISCH √ √ The Laplace (lik/likLaplace.m) likelihood Pρ (y| f ) = exp(− 2/σn |y − f |)/ 2σn with hyperparameters ρ = {ln σn } is specified in line 3. There are only simple likelihood functions: Gaussian, Sech-squared, Laplacian and Student’s t for ordinary and sparse regression as well as the error and the logistic function for classification. Again, the same inference code is used for any likelihood function. Although the specification of likelihood functions is simple for the user, writing new likelihood functions is slightly more involved as different inference methods require access to different properties; for example, LA requires second derivatives and EP requires derivatives of moments. All hyperparameters θ = {φ, ψ, ρ} are stored in a struct hyp.{mean,cov,lik}, which is initialised in line 4; we select the approximate inference algorithm EP (inf/infEP.m) in line 5. We optimise the hyperparameters θ ≡ hyp by calling the CG optimiser (util/minimize.m) with initial value θ0 ≡ hyp0 in line 6 allowing at most N = 50 evaluations of the EP approximation to the marginal likelihood ZEP (θ) as done by gp.m. Here, D = (X, y) ≡ (X,y) is the training data where X = {x1 , .., xn } and y ∈ Rn . Under the hood, gp.m computes in every step a Gaussian ∂ posterior approximation and the derivatives ∂θ ln ZEP (θ) of the marginal likelihood by calling EP. Predictions with optimised hyperparameters are done in line 7, where we call gp.m with the unseen test inputs X∗ ≡ Xs as additional argument. As a result, we obtain the approximate marginal predictive mean E[P(y∗ |D , X∗ )] ≡ ymu and the predictive variance V[P(y∗ |D , X∗ )] ≡ ys2. Likelihood \ Inference Gaussian Sech-squared Laplacian Student’s t Error function Logistic function Exact FITC EP Laplace VB Type, Output Domain regression, R regression, R regression, R regression, R classification, {±1} classification, {±1} Alternate Name logistic distribution double exponential probit regression logit regression Table 1: Likelihood ↔ inference compatibility in the GPML toolbox Table 1 gives the legal likelihood/inference combinations. Exact inference and the FITC approximation support the Gaussian likelihood only. Variational Bayesian (VB) inference is applicable to all likelihoods. Expectation propagation (EP) for the Student’s t likelihood is inherently unstable due to its non-log-concavity. The Laplace approximation (LA) for Laplace likelihoods is not sensible due to the non-differentiable peak of the Laplace likelihood. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and LA. If the number of training examples is larger than a few thousand, dense matrix computations become too slow. We provide the FITC approximation for regression with Gaussian likelihood where ˜ instead of the exact covariance matrix K, a low-rank plus diagonal matrix K = Q + diag(K − Q) ⊤ K−1 K is used. The matrices K and K contain covariances and cross-covariances where Q = Ku uu u uu u of and between inducing inputs ui and data points x j . Using inf/infFITC.m together with any covariance function wrapped into cov/covFITC.m makes the computations feasible for large n. Acknowledgments Thanks to Ed Snelson for assisting with the FITC approximation. 3014 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX References Joaquin Qui˜ onero Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gausn sian process regression. Journal of Machine Learning Research, 6(6):1935–1959, 2005. Mark N. Gibbs and David J. C. MacKay. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6):1458–1464, 2000. Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann, 2001. Hannes Nickisch and Carl E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, 10 2008. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, 2006. Christopher K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(20):1342–1351, 1998. 3015
Joaquin Qui˜ onero Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gausn sian process regression. Journal of Machine Learning Research, 6(6):1935–1959, 2005. Mark N. Gibbs and David J. C. MacKay. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6):1458–1464, 2000. Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann, 2001. Hannes Nickisch and Carl E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, 10 2008. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, 2006. Christopher K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(20):1342–1351, 1998. 3015